Configuring VPN on Mikrotik RouterOS – Part 1 – PureVPN

This is the first of a series of posts inspired by the huge number of hours I’ve recently spent configuring my Mikrotik router to get VPN working. Although there is quite a lot of information on the web, I fell over a few trips that were not documented. I hope this saves you (or worst case future me some time).

After I followed the guides found on the other pages, technically the connection was working, but it was painfully slow and some pages weren’t loading at all. Here I’ll point out what’s important.

Just one more thing – my RouterOS version was 6.33.5.

Step 1 – Interfaces

First thing to do is add an interface. I tried some of the available protocols and results were following:

  • PPTP – Worked ok
  • SSTP – Worked ok
  • OpenVPN – did not work due to lack of TLS authentication support
  • Others – did not try


An important decision here is MTU. There are plenty of resources about that on the web, so I’ll just say I set my value at 1472 (1500 – 28), which caused no fragmentation when trying to ping sites from my computers.

Make sure your PPP profile has not been modified. It’s best to create a separate profile for this connection (with default settings – just copy default-encryption) so that you don’t mess it up when setting your router as VPN server.

Step 2 – NAT

Just add masquerade on the outgoing interface.


Step 3 – Mangle

Setting up mangle consists of two parts.

Mark routing

Routing marking is optional. Use this if you want to route some of your traffic via one connection and the rest through other. For example your personal WiFi through VPN and Guest Wifi through unencrypted connection.

Mangle rules are processed sequentially in the order they appear on the . If you have multiple mark routing rules make sure latter rules don’t overwrite the value in an unexpected way. Best practice is to have more specific rules follow the less specific ones.

Change MSS

This is the real magic! 99% of the guides recommend to play with the MTU value if you have connection problems. But I set that down to the minimum value of 512 and it didn’t help either. What solve my connection issues of low speed and some pages not loading at all was this magical MSS rule. Again, the value is a bit of try and see. 1400 was a bit too much so I finally ended with 1350. But the results we amazing – speed boosted from 2-3 Mbps to 60 Mbps nearly exhausting my bandwidth. Not sure though, if this ain’t just a way around some RouterOS bug.


Step 4 – Routes

Finally, if you did not select Add Default Route in the interface setup and want to use some more sophisticated routing rules, you need to configure one. Don’t forget to use the routing mark.


Wrap up

Hope this little guide was useful. All comments and questions are welcome.

Next part will be on configuring a budget point-to-site SSTP connection to Azure virtual network.

Exploring Me @ Google Fit


Wearables are hot these days. I already got used to wearing my Jawbone wristband 24/7 and after a year of using Pebble I’ll newer go back to my old Casio watch again. In this post I’ll put down a few words about my play with Google’s platform for everyday activities data collected from wearables – Fit

Google Fit is like a huge datastore for readings from sensors, most notable being of course Android based phones. It’s a datastore that is also pretty open 0 any application can read and write data in there. The complete data model can be found here. I’ll just briefly point out key elements:

  • Datastreams – datastream is basicaly a source of data. It can be raw data like phone sensor readings or users input. There are also some more sophisticated datastreams as Google is doing some processing on it. Each datastream consists of many datapoints which are values of specific datatypes over time. Unfortunately, there is little documentation on how the values in specific streams are calculated. Only this post by David Schmidt gives some hints on what they mean
  • Datasets – Datasets are just values of the datapoints over time. You cannot just get all data from a datastream. You have to query for a specific time range. Google expects you to provide start and end date as 64 bit integers with nanosenconds from epoch. This requires some calculations but makes the service future proof. So far I found nothing with such a precision in my data.
  • Sessions – sessions are supposed to be used to record user activities . You input start date, end date and activity type to record what you were doing. Personally, I think Google guys copied the idea of recording activities by pressing the button on the Jawbone band. As of now my Google Fit app did not show any sessions even though I input a few activities in there

Getting the data

Having read the documentation I started playing, to figure out what Google knows about my living and training habits. The first natural choice to see `what’s in the data` was Google APIs Explorer. Just a few minutes of work with the tool proved the old saying about data analysis to be still true – 80% of time is spent on data preparation. Google’s tool wasn’t too friendly with input parameters and would kill the browser whenever the output file was larger than few hundred kilobytes. Certainly I needed another solution.

And I had to develop one. The choice (after a few tries) was a Ruby on Rails application that would speed up data extraction. Getting an application prototype up and running is a pretty smooth process in RoR, though I still had to implement two mechanisms:

  • oauth2 authentication – to get the data from Google Fit you have to impersonate the user
  • Some sort of data extraction interface

So here comes the app.

The UI is based on the most common stack of open source components like Bootstrap, SSAS, etc.

Authentication is via Google Account only. Default view for the user is the list of his datastreams (datasources). After logging in, the user must refresh the list manually. As the picture shows, I already have 27 streams in Google Fit, most of them coming from which is Google Play service running in the background on phone. The list lets you either to get JSON specification of the datastream or dig in deeper to the datastream extraction form.

Submitting the form gives you a nice formatted JSON file with your datapoints.

Now that I had the data and an easy way to get it out, it was time dig into it.

Exploring the data

27 datastreams is quite a lot. Some of them overlapped or are preprocessed so browsing through them was the first step for any further analysis. At this part R came handful. The get the data in also structure the code I decided to bundle it into an R package I called Fitness. It can be found at:

So simply type in

#And don't forget to load it

to get the package installed into your R.

After browsing the downloaded JSON files and their respective structures I decided to go on with following datastreams:


The dataset contains the distance in meters traveled in the reading period. The data is filtered and preprocessed by Google in this datastream so I expect it to be clean without outliers.

> distance <- import_distance(path_to_JSON_file)
# Few sample rows
            startTime             endTime      val
1 2014-11-27 08:31:02 2014-11-27 08:38:34 265.9062
2 2014-11-27 08:38:34 2014-11-27 08:42:03 574.2574
3 2014-11-27 08:42:03 2014-11-27 08:45:07 626.4064
4 2014-11-27 08:45:07 2014-11-27 08:47:39 624.4772
5 2014-11-27 08:47:39 2014-11-27 08:50:19 607.6943


The dataset contains data about cell phone positions recorded. Analogically to the previous stream, it is also preprocessed. Despite having start and end time values are always sing time points.

> location <- import_location(path_to_JSON_file)
# Few sample rows (fake coordinate values - that's sensitive information)
            startTime             endTime      lat      long    alt
1 2014-11-14 19:35:21 2014-11-14 19:35:21 42.60910 -21.3315 31.500
2 2014-11-14 19:37:54 2014-11-14 19:37:54 42.60912 -21.3315 25.505
3 2014-11-14 19:40:25 2014-11-14 19:40:25 42.61017 -21.3314 38.000
4 2014-11-14 19:43:49 2014-11-14 19:43:49 42.61018 -21.3319 30.000
5 2014-11-14 19:49:50 2014-11-14 19:49:50 42.60911 -21.3316 48.000


This is a pretty interesting datastream. The values are The details how they are calculated aren’t public, but since the source is I guess it a sum of activities detected on the phone. My levels of the variable were

> levels(activity$activity)
[1] "0"  "1"  "3"  "7"  "8"  "24"

which respectively stands for: In vehicle, Biking, Still (not moving), Walking, Running, Dancing. It’s seems pretty obvious that the values are calculated basing on distance / speed. Still 2 of them were a bit mysterious to me.

  • Binking – I don’t remember doing any biking in the time period. There was just one observation – a false reading or interpretation by Google
  • Dancing – Can Google really detect dancing from accelerometer data? No, that was my manual input in Google Fit app. A more interesting question is why Google decided to handle this as a datapoint value and not as a session.

The contingency table for my activities looks like this:

> table(activity$activity)

   0    1    3    7    8   24
 288    1 1017  903    9    1

Importing is identical as before.

> activity <- import_activity(path_to_JSON_file)
# Few sample rows
            startTime             endTime activity
1 2014-01-06 01:46:05 2014-01-06 01:48:06        7
2 2014-01-06 01:48:06 2014-01-06 02:00:24        3
3 2014-01-06 02:00:24 2014-01-06 02:02:23        7
4 2014-01-06 02:02:23 2014-01-06 02:10:34        3
5 2014-01-06 02:10:34 2014-01-06 02:19:39        7


As expected, the datastream gives step counts in the time period. Again, the function to import is provided.

> steps<- import_steps(path_to_JSON_file)
# Few sample rows
            startTime             endTime steps
1 2014-01-06 01:45:18 2014-01-06 01:46:35   151
2 2014-01-06 01:46:35 2014-01-06 01:46:42    12
3 2014-01-06 01:46:42 2014-01-06 01:47:35    49
4 2014-01-06 01:47:51 2014-01-06 01:48:51     1
5 2014-01-06 01:48:51 2014-01-06 01:48:51     1

derive_step_cadence< 5:XXXXXXXX:Step Counter

The last datastream is very similar to the steps except for that it provides cadence which measures the pace in steps per minute.

> cadence <- import_cadence(path_to_JSON_file)
# Few sample rows
            startTime             endTime    cadence
1 2014-01-06 01:45:18 2014-01-06 01:46:35 116.882370
2 2014-01-06 01:46:35 2014-01-06 01:46:42  98.839378
3 2014-01-06 01:46:42 2014-01-06 01:47:35  55.765633
4 2014-01-06 01:47:35 2014-01-06 01:48:51   0.793142
5 2014-01-06 01:48:51 2014-01-06 01:48:51 170.223389

To decide on selecting above datastreams I explored all my data on Google Fit. The streams start in November 2014 and together have about 300 000 observations. In total that’s about 60MB of JSON data. But if you consider that it’s not even been three months since the start of the service, the calculation is a bit more scary:

  • Over 3500 readings per day
  • Over 150 readings per hour
  • Over 2.5 readings per minut

Preparing the data

All the above data are time series. The simple way to check if there is a simple link between them would be an attempt to just to merge them. Below are the observations counts for original end merged (inner jointer) time series. I tried merging both by start and end dates.

> row_dependencies(distance,location,activity,steps,cadence)

[1] "Distance: 2419"
[1] "Location: 14659"
[1] "Activity: 2219"
[1] "Steps: 11254"
[1] "Cadence: 8383"
[1] "Distance + Location (startTime + endTime): 0"
[1] "Distance + Location (startTime): 2183"
[1] "Distance + Location (endTime): 2150"
[1] "Distance + Activity (startTime): 4"
[1] "Distance + Activity (endTime): 1"
[1] "Location + Activity (startTime): 5"
[1] "Location + Activity (endTime): 5"
[1] "Distance + Steps (startTime): 29"
[1] "Location + Steps (startTime): 45"
[1] "Activity + Steps (startTime): 33"
[1] "Activity + Steps (endTime): 38"
[1] "Steps + Cadence (startTime + endTime): 4698"
[1] "Steps + Cadence (startTime): 4812"
[1] "Steps + Cadence (endTime): 7595"

Conclusions are following:

  • Distance can be merged with location without any significant data loss
  • Steps and cadence can be merged by some pretty simple algorithm operating on start and end times
  • To overlay the two merged groups with detected activity some sort of binning will have to be used

Now that I knew the data very well I had to asked myself a question “What do I want to get out of it?”. My 3 questions were:

  1. What are the times of the week I walk the most ?
  2. What are the top 5 places I spend my time at? I mean the 3 other than work and home.
  3. What’s my daily time distribution of activities? (Walking, running, etc)

One other question was “How much time will I spend running in the next weeks?”. Since it’s February now and I only have data since November the results wouldn’t make to much sense as the weather is only doing to improve

Weekly steps heat map

To prepare a heat map I had to assign readings into time bins. If a reading is entirely in a single bin time there is no problem. However if a reading spans across multiple bins the function divides the measured value equally across all bins. R code for the binning function is:

create_bins <- function(sourceData,startTimeColumn,endTimeColumn,valueColumn,nbins=24) {

  binValues <- matrix(0,nrow = nbins, ncol = 7)
  binSize <- 60*24 / nbins

  for(i in 1:nrow(sourceData)){    

    startTime <- sourceData[i,startTimeColumn]
    startDay <- as.numeric(strftime(startTime,'%u'))
    startBinDay <- (as.numeric(difftime(startTime,trunc(startTime,units = "day"),units = "mins")) %/% binSize) + 1
    startBin <- (startDay - 1) * nbins + startBinDay

    endTime <- sourceData[i,endTimeColumn]
    endDay <- as.numeric(strftime(endTime,'%u'))
    endBinDay <- (as.numeric(difftime(endTime,trunc(endTime,units = "day"),units = "mins")) %/% binSize) + 1
    endBin <- (endDay - 1) * nbins + endBinDay

    binNum <- endBin - startBin +1

    for (j in startBin:endBin)
      binValues[j] <- binValues[j] + sourceData[i,valueColumn] / binNum


  return (binValues)


A quick check comparing sum of not binned and binned steps proved the function to work correctly.

> create_bins(steps,'startTime','endTime','steps') -> steps_binned
> sum(steps_binned)
[1] 436204
> sum(steps$steps)
[1] 436204

So here is a nice heat map generated using D3.

Conclusions? My most intensive time is the daily commute to work, lunch times and some weekend activities.

Where do I spend my time

In this part of analysis I followed the ideas of Jeffrey Evans, with the remark that the literature widely accepts k-means based methods for spatial data clustering. The clara method proved quite efficient so I modified his function a bit.

opt_cluster <- function(x, k_max=10,k_chosen=10, ...) {
  if (!require(cluster)) stop("cluster PACKAGE MISSING")
  asw <- numeric(k_max)
  for (k in 2:k_max) {  
    asw[k] <- clara(x, k, ...)$silinfo$avg.width
  } <- which.max(asw)
  print(paste("OPTIMAL-K",, sep=": "))
  plot(1:k_max, asw, type="s", main="Clustering Optimization using K-Mediods",
       xlab="K (number of clusters)", ylab = "mean silhouette width")
  axis(1,, paste("best",,sep="\n"), col="red",col.axis="red")
  return(clara(x, k_chosen, ...))


While the function recommended 3 clusters, I decided to override the recommendation. I won’t post specific locations here cause that data is a bit too sensitive, but it is pretty easy to explain that happened in the results. I traveled a lot around the world in the time period. Places I normally often go to got grouped to the location of my home and the two other places were locations where I traveled to (very precise by the way)

Optimal clusters

The clustering code was

cl <- opt_cluster(location[,c('lat','long')],k_max=40,k_chosen=23)
location_clustered <- cbind(location, cluster=cl$clustering)

An aggregation of the point counts gave me:

> sort(table(location_clustered[,'cluster']),decreasing = TRUE)

  18   16   20   17    2    7    4   22    1   19    6    3   13   10   14    5    8   15   23   11    9   12   21 
7697 4088  768  542  255  232  202  126  115  111   79   77   50   48   46   37   33   33   33   26   25   18   18 

Where the clusters are:

  • 18 – Home
  • 16 – Work
  • 20 – My parent’s home
  • 17 – A point in the center of my daily work and home commute – that’s scarrying
  • 2 – A hotel I stayed in during vacations. That’s a pretty surprising place, cause I stayed at other hotels longer. This might be related to Wifi readings.

Daily activities distribution

Finally, let’s calculate my the percentages of time I spent on particular activities …

> activity_duration <- cbind(activity,duration= as.numeric(difftime(activity$endTime,activity$startTime,units = "mins")))
> total_durations <- aggregate(duration ~ activity, activity_duration, sum)
> duration_percentage <- cbind(total_duration, percentage = round(total_duration[,'duration'] / sum(total_duration[,'duration']),4))
> duration_percentage
  activity  duration percentage
1        0  7656.567     0.0703
2        1     1.350     0.0000
3        3 97242.733     0.8924
4        7  3926.300     0.0360
5        8    16.750     0.0002
6       24   120.000     0.0011

… and put it on a nice D3 histogram.

Not too much running. I need to improve.

Wrap up

Google Fit is a service that offers a broad set of data that could be a basis of many interesting research. It’s still a new service – span of the data is pretty short, documentation is not complete and there is no friendly way to get the data out.

Powered by:

  • Ruby on Rails
  • R
  • D3

Something nice for all SAS & statistics lovers

It’s been quite a while since I last released any of my projects. Today I do it again.

What is it ? An add-in for SAS Enterprise Guide adding support for Proc Lifereg.

For whom?

  • People working with this topic everyday – may make your work a bit easier
  • All SAS Enterprise Guide users – demonstrates what add-ins can do

You can find it here.

Have fun.