Blog

5 reasons to use SQL Common Table Expressions (WITH clause)

After years of working with SQL and optimizing hundreds of poorly written queries, I decided to write this post to encourage use of what I found to be one of the SQL’s most underappreciated features – Common Table Expressions, often known as ‘With clause’.

What is a Common Table Expression

In short, CTEs are about taking parts of a SQL query, naming them, and referring to them later. Just like in this example.

WITH some_name AS (
SELECT column_1, max(column_2) as max_2
FROM table_1
GROUP BY column_1
)
SELECT t2.column_3, s.max_2
FROM table_2 t2
JOIN some_name s
ON t2.column_1 = s.column_1

If you need more details, have a look here

1. Readability

How many times have you seen queries like this?

SELECT * FROM (
  SELECT
    column_1,
    max(sum_1) as max_1
  FROM (
    SELECT
      column_1,
      column_2,
      sum(column_3) as sum_1
    FROM table
    GROUP BY column_1, column_2
   )
   GROUP BY column_1
) WHERE max_1 < 3

Obviously, somebody else wrote the query, you have to analyze/modify/fix it, and it is much longer and more complicated than the example above (a few extra joins, dozens of columns etc). So:

  1. For a complex query, it takes you some time before you even figure out where the logic starts
  2. You go crazy as you keep scrolling up and down the editor, because the column names are at the top, and a corresponding WHERE condition is at the bottom, 500 lines down.

Isn’t this version easier to read?

WITH q_1 AS (
SELECT
  column_1,
  column_2,
  sum(column_3) as sum_1
FROM table
GROUP BY column_1, column_2
),
q_2 AS (
SELECT
  column_1,
  max(sum_1) as max_1
FROM q_1
   GROUP BY column_1 
)
SELECT *
FROM q_2
WHERE max_1 < 3

2. Logical flow

In the rewritten query with CTEs the flow goes from the top to the bottom. Furthermore, instead of names like q_1 you can use more meaningful ones like t1_sum (in the first example, subqueries are not named at all). Together with proper commenting, your SQL code will be much better reflecting the business logic.

3. Less code

Let’s have a look the following example:

SELECT
  t1.col_1,
  t2.col_3,
  t3.col_3
FROM
t1
JOIN
(SELECT DISTINCT
 col_2,
 col_3
 FROM xyz
 WHERE col_2 = 2 and col_3 < 15) t2
ON t1.col_2 = t2.col_2
JOIN
(SELECT DISTINCT
 col_2,
 col_3
 FROM xyz
 WHERE col_2 = 2 and col_3 > 22) t3
ON t1.col_2 = t3.col_3

You probably have noticed, that parts of the query repeat. These can be extracted into CTEs and rewritten like this:

WITH dist AS (
SELECT DISTINCT
  col_2,
  col_3
FROM xyz
WHERE col_2 = 2
),
t2 AS (
  SELECT * FROM dist WHERE col_3 < 15
),
t3 AS (
  SELECT * FROM dist WHERE col_2 < 22
)
SELECT
  t1.col_1,
  t2.col_3,
  t3.col_3
FROM
t1
JOIN t2
ON t1.col_2 = t2.col_2
JOIN t3
ON t1.col_2 = t3.col_3

In this particular example, the savings in code length might not be so spectacular, but as you start rewriting your queries, you’ll discover that many repeating parts can be eliminated. The authors of the database you’re working with probably also made some observation about repeating, so please read the next section carefully,

Performance and hints

Contrary to what many inexperienced users believe, a SQL statement is not executed ‘as a whole’, but it a series of operations like reads, joins, sorts, unions etc, executed most often sequentially, sometimes in parallel to each other, sometimes in a paralleled way. It depends on the specific database engine how CTEs are processed, but proper use of them is a very powerful tool when tuning SQL queries. I will provide 2 examples how Oracle can take advantage of them:

  1. If a CTE is referred to in the further part of the query more than once, it may get materialized. This means it is saved to a temporary table, so that the preceding operations are not repeated
  2. Using hints, like MATERIALIZE, it is possible to force saving part of a query to a temporary table. This, in turn, influences all the joins downstream, as the temporary table has no indexes, and is hash-joined whenever the join conditions permit

On the other hand, there are engines like Redshift, which do not materialize CTEs. That should not discourage you from using CTEs to structure your code.

5. CTE recursion

Another feature that is database specific is CTE recursion, or otherwise the ability for the query to reference itself, to generate new rows basing on rows so far. The implementation differs between the databases, but if your worked with CONNECT BY queries in Oracle and now moved to MS SQL server, you’re likely to run into them. Google for more.

No more 10 level subqueries

I hope my arguments were convincing, and will help you write cleaner, better SQL queries. Not only for yourself, but also for those who will be debugging or changing them in the future.

Configuring VPN on Mikrotik RouterOS – Part 1 – PureVPN

This is the first of a series of posts inspired by the huge number of hours I’ve recently spent configuring my Mikrotik router to get VPN working. Although there is quite a lot of information on the web, I fell over a few trips that were not documented. I hope this saves you (or worst case future me some time).

After I followed the guides found on the other pages, technically the connection was working, but it was painfully slow and some pages weren’t loading at all. Here I’ll point out what’s important.

Just one more thing – my RouterOS version was 6.33.5.

Step 1 – Interfaces

First thing to do is add an interface. I tried some of the available protocols and results were following:

  • PPTP – Worked ok
  • SSTP – Worked ok
  • OpenVPN – did not work due to lack of TLS authentication support
  • Others – did not try

Interfaces

An important decision here is MTU. There are plenty of resources about that on the web, so I’ll just say I set my value at 1472 (1500 – 28), which caused no fragmentation when trying to ping sites from my computers.

Make sure your PPP profile has not been modified. It’s best to create a separate profile for this connection (with default settings – just copy default-encryption) so that you don’t mess it up when setting your router as VPN server.

Step 2 – NAT

Just add masquerade on the outgoing interface.

NAT

Step 3 – Mangle

Setting up mangle consists of two parts.

Mark routing

Routing marking is optional. Use this if you want to route some of your traffic via one connection and the rest through other. For example your personal WiFi through VPN and Guest Wifi through unencrypted connection.

Mangle rules are processed sequentially in the order they appear on the . If you have multiple mark routing rules make sure latter rules don’t overwrite the value in an unexpected way. Best practice is to have more specific rules follow the less specific ones.

Change MSS

This is the real magic! 99% of the guides recommend to play with the MTU value if you have connection problems. But I set that down to the minimum value of 512 and it didn’t help either. What solve my connection issues of low speed and some pages not loading at all was this magical MSS rule. Again, the value is a bit of try and see. 1400 was a bit too much so I finally ended with 1350. But the results we amazing – speed boosted from 2-3 Mbps to 60 Mbps nearly exhausting my bandwidth. Not sure though, if this ain’t just a way around some RouterOS bug.

Mangle

Step 4 – Routes

Finally, if you did not select Add Default Route in the interface setup and want to use some more sophisticated routing rules, you need to configure one. Don’t forget to use the routing mark.

Routing

Wrap up

Hope this little guide was useful. All comments and questions are welcome.

Next part will be on configuring a budget point-to-site SSTP connection to Azure virtual network.

Exploring Me @ Google Fit

Intro

Wearables are hot these days. I already got used to wearing my Jawbone wristband 24/7 and after a year of using Pebble I’ll newer go back to my old Casio watch again. In this post I’ll put down a few words about my play with Google’s platform for everyday activities data collected from wearables – Fit

Google Fit is like a huge datastore for readings from sensors, most notable being of course Android based phones. It’s a datastore that is also pretty open 0 any application can read and write data in there. The complete data model can be found here. I’ll just briefly point out key elements:

  • Datastreams – datastream is basically a source of data. It can be raw data like phone sensor readings or users input. There are also some more sophisticated datastreams as Google is doing some processing on it. Each datastream consists of many datapoints which are values of specific datatypes over time. Unfortunately, there is little documentation on how the values in specific streams are calculated. Only this post by David Schmidt gives some hints on what they mean
  • Datasets – Datasets are just values of the datapoints over time. You cannot just get all data from a datastream. You have to query for a specific time range. Google expects you to provide start and end date as 64 bit integers with nanoseconds from epoch. This requires some calculations but makes the service future proof. So far I found nothing with such a precision in my data.
  • Sessions – sessions are supposed to be used to record user activities . You input start date, end date and activity type to record what you were doing. Personally, I think Google guys copied the idea of recording activities by pressing the button on the Jawbone band. As of now my Google Fit app did not show any sessions even though I input a few activities in there

Getting the data

Having read the documentation I started playing, to figure out what Google knows about my living and training habits. The first natural choice to see `what’s in the data` was Google APIs Explorer. Just a few minutes of work with the tool proved the old saying about data analysis to be still true – 80% of time is spent on data preparation. Google’s tool wasn’t too friendly with input parameters and would kill the browser whenever the output file was larger than few hundred kilobytes. Certainly I needed another solution.

And I had to develop one. The choice (after a few tries) was a Ruby on Rails application that would speed up data extraction. Getting an application prototype up and running is a pretty smooth process in RoR, though I still had to implement two mechanisms:

  • oauth2 authentication – to get the data from Google Fit you have to impersonate the user
  • Some sort of data extraction interface

So here comes the app.

The UI is based on the most common stack of open source components like Bootstrap, SSAS, etc.

Authentication is via Google Account only. Default view for the user is the list of his datastreams (datasources). After logging in, the user must refresh the list manually. As the picture shows, I already have 27 streams in Google Fit, most of them coming from com.google.android.gms which is Google Play service running in the background on phone. The list lets you either to get JSON specification of the datastream or dig in deeper to the datastream extraction form.

Submitting the form gives you a nice formatted JSON file with your datapoints.

Now that I had the data and an easy way to get it out, it was time dig into it.

Exploring the data

27 datastreams is quite a lot. Some of them overlapped or are preprocessed so browsing through them was the first step for any further analysis. At this part R came handful. The get the data in also structure the code I decided to bundle it into an R package I called Fitness. It can be found at:

So simply type in

devtools::install_github('ms32035/R-fitness')
#And don't forget to load it
library(fitness)

to get the package installed into your R.

After browsing the downloaded JSON files and their respective structures I decided to go on with following datastreams:

pruned_distance

The dataset contains the distance in meters traveled in the reading period. The data is filtered and preprocessed by Google in this datastream so I expect it to be clean without outliers.

> distance <- import_distance(path_to_JSON_file)
# Few sample rows
            startTime             endTime      val
1 2014-11-27 08:31:02 2014-11-27 08:38:34 265.9062
2 2014-11-27 08:38:34 2014-11-27 08:42:03 574.2574
3 2014-11-27 08:42:03 2014-11-27 08:45:07 626.4064
4 2014-11-27 08:45:07 2014-11-27 08:47:39 624.4772
5 2014-11-27 08:47:39 2014-11-27 08:50:19 607.6943

merge_location_samples

The dataset contains data about cell phone positions recorded. Analogically to the previous stream, it is also preprocessed. Despite having start and end time values are always sing time points.

> location <- import_location(path_to_JSON_file)
# Few sample rows (fake coordinate values - that's sensitive information)
            startTime             endTime      lat      long    alt
1 2014-11-14 19:35:21 2014-11-14 19:35:21 42.60910 -21.3315 31.500
2 2014-11-14 19:37:54 2014-11-14 19:37:54 42.60912 -21.3315 25.505
3 2014-11-14 19:40:25 2014-11-14 19:40:25 42.61017 -21.3314 38.000
4 2014-11-14 19:43:49 2014-11-14 19:43:49 42.61018 -21.3319 30.000
5 2014-11-14 19:49:50 2014-11-14 19:49:50 42.60911 -21.3316 48.000

merge_activity_segments

This is a pretty interesting datastream. The values are com.google.activity.segment The details how they are calculated aren’t public, but since the source is com.google.android.gms I guess it a sum of activities detected on the phone. My levels of the variable were

> levels(activity$activity)
[1] "0"  "1"  "3"  "7"  "8"  "24"

which respectively stands for: In vehicle, Biking, Still (not moving), Walking, Running, Dancing. It’s seems pretty obvious that the values are calculated basing on distance / speed. Still 2 of them were a bit mysterious to me.

  • Biking – I don’t remember doing any biking in the time period. There was just one observation – a false reading or interpretation by Google
  • Dancing – Can Google really detect dancing from accelerometer data? No, that was my manual input in Google Fit app. A more interesting question is why Google decided to handle this as a datapoint value and not as a session.

The contingency table for my activities looks like this:

> table(activity$activity)

   0    1    3    7    8   24
 288    1 1017  903    9    1

Importing is identical as before.

> activity <- import_activity(path_to_JSON_file)
# Few sample rows
            startTime             endTime activity
1 2014-01-06 01:46:05 2014-01-06 01:48:06        7
2 2014-01-06 01:48:06 2014-01-06 02:00:24        3
3 2014-01-06 02:00:24 2014-01-06 02:02:23        7
4 2014-01-06 02:02:23 2014-01-06 02:10:34        3
5 2014-01-06 02:10:34 2014-01-06 02:19:39        7

merge_step_deltas

As expected, the datastream gives step counts in the time period. Again, the function to import is provided.

> steps <- import_steps(path_to_JSON_file)
# Few sample rows
            startTime             endTime steps
1 2014-01-06 01:45:18 2014-01-06 01:46:35   151
2 2014-01-06 01:46:35 2014-01-06 01:46:42    12
3 2014-01-06 01:46:42 2014-01-06 01:47:35    49
4 2014-01-06 01:47:51 2014-01-06 01:48:51     1
5 2014-01-06 01:48:51 2014-01-06 01:48:51     1

derive_step_cadence<-raw:com.google.step_count.cumulative:LGE:Nexus 5:XXXXXXXX:Step Counter

The last datastream is very similar to the steps except for that it provides cadence which measures the pace in steps per minute.

> cadence <- import_cadence(path_to_JSON_file)
# Few sample rows
            startTime             endTime    cadence
1 2014-01-06 01:45:18 2014-01-06 01:46:35 116.882370
2 2014-01-06 01:46:35 2014-01-06 01:46:42  98.839378
3 2014-01-06 01:46:42 2014-01-06 01:47:35  55.765633
4 2014-01-06 01:47:35 2014-01-06 01:48:51   0.793142
5 2014-01-06 01:48:51 2014-01-06 01:48:51 170.223389

To decide on selecting above datastreams I explored all my data on Google Fit. The streams start in November 2014 and together have about 300 000 observations. In total that’s about 60MB of JSON data. But if you consider that it’s not even been three months since the start of the service, the calculation is a bit more scary:

  • Over 3500 readings per day
  • Over 150 readings per hour
  • Over 2.5 readings per minut

Preparing the data

All the above data are time series. The simple way to check if there is a simple link between them would be an attempt to just to merge them. Below are the observations counts for original end merged (inner jointer) time series. I tried merging both by start and end dates.

> row_dependencies(distance,location,activity,steps,cadence)

[1] "Distance: 2419"
[1] "Location: 14659"
[1] "Activity: 2219"
[1] "Steps: 11254"
[1] "Cadence: 8383"
[1] "Distance + Location (startTime + endTime): 0"
[1] "Distance + Location (startTime): 2183"
[1] "Distance + Location (endTime): 2150"
[1] "Distance + Activity (startTime): 4"
[1] "Distance + Activity (endTime): 1"
[1] "Location + Activity (startTime): 5"
[1] "Location + Activity (endTime): 5"
[1] "Distance + Steps (startTime): 29"
[1] "Location + Steps (startTime): 45"
[1] "Activity + Steps (startTime): 33"
[1] "Activity + Steps (endTime): 38"
[1] "Steps + Cadence (startTime + endTime): 4698"
[1] "Steps + Cadence (startTime): 4812"
[1] "Steps + Cadence (endTime): 7595"

Conclusions are following:

  • Distance can be merged with location without any significant data loss
  • Steps and cadence can be merged by some pretty simple algorithm operating on start and end times
  • To overlay the two merged groups with detected activity some sort of binning will have to be used

Now that I knew the data very well I had to asked myself a question “What do I want to get out of it?”. My 3 questions were:

  1. What are the times of the week I walk the most ?
  2. What are the top 5 places I spend my time at? I mean the 3 other than work and home.
  3. What’s my daily time distribution of activities? (Walking, running, etc)

One other question was “How much time will I spend running in the next weeks?”. Since it’s February now and I only have data since November the results wouldn’t make to much sense as the weather is only doing to improve

Weekly steps heat map

To prepare a heat map I had to assign readings into time bins. If a reading is entirely in a single bin time there is no problem. However if a reading spans across multiple bins the function divides the measured value equally across all bins. R code for the binning function is:

create_bins <- function(sourceData,startTimeColumn,endTimeColumn,valueColumn,nbins=24) {

  binValues <- matrix(0,nrow = nbins, ncol = 7)
  binSize <- 60*24 / nbins

  for(i in 1:nrow(sourceData)){    

    startTime <- sourceData[i,startTimeColumn]
    startDay <- as.numeric(strftime(startTime,'%u'))
    startBinDay <- (as.numeric(difftime(startTime,trunc(startTime,units = "day"),units = "mins")) %/% binSize) + 1
    startBin <- (startDay - 1) * nbins + startBinDay

    endTime <- sourceData[i,endTimeColumn]
    endDay <- as.numeric(strftime(endTime,'%u'))
    endBinDay <- (as.numeric(difftime(endTime,trunc(endTime,units = "day"),units = "mins")) %/% binSize) + 1
    endBin <- (endDay - 1) * nbins + endBinDay

    binNum <- endBin - startBin +1

    for (j in startBin:endBin)
      binValues[j] <- binValues[j] + sourceData[i,valueColumn] / binNum

  }

  return (binValues)

}

A quick check comparing sum of not binned and binned steps proved the function to work correctly.

> create_bins(steps,'startTime','endTime','steps') -> steps_binned
> sum(steps_binned)
[1] 436204
> sum(steps$steps)
[1] 436204

So here is a nice heat map generated using D3.

Conclusions? My most intensive time is the daily commute to work, lunch times and some weekend activities.

Where do I spend my time

In this part of analysis I followed the ideas of Jeffrey Evans, with the remark that the literature widely accepts k-means based methods for spatial data clustering. The clara method proved quite efficient so I modified his function a bit.

opt_cluster <- function(x, k_max=10,k_chosen=10, ...) {
  
  if (!require(cluster)) stop("cluster PACKAGE MISSING")
  asw <- numeric(k_max)
  
  for (k in 2:k_max) {  
    asw[k] <- clara(x, k, ...)$silinfo$avg.width
  }
  
  k.best <- which.max(asw)
  
  print(paste("OPTIMAL-K", k.best, sep=": "))
    
  plot(1:k_max, asw, type="s", main="Clustering Optimization using K-Mediods",
       xlab="K (number of clusters)", ylab = "mean silhouette width")
  axis(1, k.best, paste("best",k.best,sep="\n"), col="red",col.axis="red")
    
  return(clara(x, k_chosen, ...))

}

While the function recommended 3 clusters, I decided to override the recommendation. I won't post specific locations here cause that data is a bit too sensitive, but it is pretty easy to explain that happened in the results. I traveled a lot around the world in the time period. Places I normally often go to got grouped to the location of my home and the two other places were locations where I traveled to (very precise by the way)

Optimal clusters

The clustering code was

cl <- opt_cluster(location[,c('lat','long')],k_max=40,k_chosen=23)
location_clustered <- cbind(location, cluster=cl$clustering)

An aggregation of the point counts gave me:

> sort(table(location_clustered[,'cluster']),decreasing = TRUE)

  18   16   20   17    2    7    4   22    1   19    6    3   13   10   14    5    8   15   23   11    9   12   21 
7697 4088  768  542  255  232  202  126  115  111   79   77   50   48   46   37   33   33   33   26   25   18   18 

Where the clusters are:

  • 18 - Home
  • 16 - Work
  • 20 - My parent's home
  • 17 - A point in the center of my daily work and home commute - that's scarrying
  • 2 - A hotel I stayed in during vacations. That's a pretty surprising place, cause I stayed at other hotels longer. This might be related to Wifi readings.

Daily activities distribution

Finally, let's calculate my the percentages of time I spent on particular activities ...

> activity_duration <- cbind(activity,duration= as.numeric(difftime(activity$endTime,activity$startTime,units = "mins"))) > total_durations <- aggregate(duration ~ activity, activity_duration, sum) > duration_percentage <- cbind(total_duration, percentage = round(total_duration[,'duration'] / sum(total_duration[,'duration']),4)) > duration_percentage
  activity  duration percentage
1        0  7656.567     0.0703
2        1     1.350     0.0000
3        3 97242.733     0.8924
4        7  3926.300     0.0360
5        8    16.750     0.0002
6       24   120.000     0.0011

... and put it on a nice D3 histogram.

Not too much running. I need to improve.

Wrap up

Google Fit is a service that offers a broad set of data that could be a basis of many interesting research. It's still a new service - span of the data is pretty short, documentation is not complete and there is no friendly way to get the data out.

Powered by:

  • Ruby on Rails
  • R
  • D3

Something nice for all SAS & statistics lovers

It’s been quite a while since I last released any of my projects. Today I do it again.

What is it ? An add-in for SAS Enterprise Guide adding support for Proc Lifereg.

For whom?

  • People working with this topic everyday – may make your work a bit easier
  • All SAS Enterprise Guide users – demonstrates what add-ins can do

You can find it here.

Have fun.