Exploring Me @ Google Fit

Intro

Wearables are hot these days. I already got used to wearing my Jawbone wristband 24/7 and after a year of using Pebble I’ll newer go back to my old Casio watch again. In this post I’ll put down a few words about my play with Google’s platform for everyday activities data collected from wearables – Fit

Google Fit is like a huge datastore for readings from sensors, most notable being of course Android based phones. It’s a datastore that is also pretty open 0 any application can read and write data in there. The complete data model can be found here. I’ll just briefly point out key elements:

Datastreams – datastream is basically a source of data. It can be raw data like phone sensor readings or users input. There are also some more sophisticated datastreams as Google is doing some processing on it. Each datastream consists of many datapoints which are values of specific datatypes over time. Unfortunately, there is little documentation on how the values in specific streams are calculated. Only this post by David Schmidt gives some hints on what they mean
Datasets – Datasets are just values of the datapoints over time. You cannot just get all data from a datastream. You have to query for a specific time range. Google expects you to provide start and end date as 64 bit integers with nanoseconds from epoch. This requires some calculations but makes the service future proof. So far I found nothing with such a precision in my data.
Sessions – sessions are supposed to be used to record user activities . You input start date, end date and activity type to record what you were doing. Personally, I think Google guys copied the idea of recording activities by pressing the button on the Jawbone band. As of now my Google Fit app did not show any sessions even though I input a few activities in there

Getting the data

Having read the documentation I started playing, to figure out what Google knows about my living and training habits. The first natural choice to see `what’s in the data` was Google APIs Explorer. Just a few minutes of work with the tool proved the old saying about data analysis to be still true – 80% of time is spent on data preparation. Google’s tool wasn’t too friendly with input parameters and would kill the browser whenever the output file was larger than few hundred kilobytes. Certainly I needed another solution.

And I had to develop one. The choice (after a few tries) was a Ruby on Rails application that would speed up data extraction. Getting an application prototype up and running is a pretty smooth process in RoR, though I still had to implement two mechanisms:

oauth2 authentication – to get the data from Google Fit you have to impersonate the user
Some sort of data extraction interface

So here comes the app.

The UI is based on the most common stack of open source components like Bootstrap, SSAS, etc.

Authentication is via Google Account only. Default view for the user is the list of his datastreams (datasources). After logging in, the user must refresh the list manually. As the picture shows, I already have 27 streams in Google Fit, most of them coming from com.google.android.gms which is Google Play service running in the background on phone. The list lets you either to get JSON specification of the datastream or dig in deeper to the datastream extraction form.

Submitting the form gives you a nice formatted JSON file with your datapoints.

Now that I had the data and an easy way to get it out, it was time dig into it.

Exploring the data

27 datastreams is quite a lot. Some of them overlapped or are preprocessed so browsing through them was the first step for any further analysis. At this part R came handful. The get the data in also structure the code I decided to bundle it into an R package I called Fitness. It can be found at:

So simply type in

devtools::install_github('ms32035/R-fitness')
#And don't forget to load it
library(fitness)

to get the package installed into your R.

After browsing the downloaded JSON files and their respective structures I decided to go on with following datastreams:

pruned_distance

The dataset contains the distance in meters traveled in the reading period. The data is filtered and preprocessed by Google in this datastream so I expect it to be clean without outliers.

> distance <- import_distance(path_to_JSON_file)
# Few sample rows
            startTime             endTime      val
1 2014-11-27 08:31:02 2014-11-27 08:38:34 265.9062
2 2014-11-27 08:38:34 2014-11-27 08:42:03 574.2574
3 2014-11-27 08:42:03 2014-11-27 08:45:07 626.4064
4 2014-11-27 08:45:07 2014-11-27 08:47:39 624.4772
5 2014-11-27 08:47:39 2014-11-27 08:50:19 607.6943

merge_location_samples

The dataset contains data about cell phone positions recorded. Analogically to the previous stream, it is also preprocessed. Despite having start and end time values are always sing time points.

> location <- import_location(path_to_JSON_file)
# Few sample rows (fake coordinate values - that's sensitive information)
            startTime             endTime      lat      long    alt
1 2014-11-14 19:35:21 2014-11-14 19:35:21 42.60910 -21.3315 31.500
2 2014-11-14 19:37:54 2014-11-14 19:37:54 42.60912 -21.3315 25.505
3 2014-11-14 19:40:25 2014-11-14 19:40:25 42.61017 -21.3314 38.000
4 2014-11-14 19:43:49 2014-11-14 19:43:49 42.61018 -21.3319 30.000
5 2014-11-14 19:49:50 2014-11-14 19:49:50 42.60911 -21.3316 48.000

merge_activity_segments

This is a pretty interesting datastream. The values are com.google.activity.segment The details how they are calculated aren’t public, but since the source is com.google.android.gms I guess it a sum of activities detected on the phone. My levels of the variable were

> levels(activity$activity)
[1] "0"  "1"  "3"  "7"  "8"  "24"

which respectively stands for: In vehicle, Biking, Still (not moving), Walking, Running, Dancing. It’s seems pretty obvious that the values are calculated basing on distance / speed. Still 2 of them were a bit mysterious to me.

Biking – I don’t remember doing any biking in the time period. There was just one observation – a false reading or interpretation by Google
Dancing – Can Google really detect dancing from accelerometer data? No, that was my manual input in Google Fit app. A more interesting question is why Google decided to handle this as a datapoint value and not as a session.

The contingency table for my activities looks like this:

> table(activity$activity)

   0    1    3    7    8   24
 288    1 1017  903    9    1

Importing is identical as before.

> activity <- import_activity(path_to_JSON_file)
# Few sample rows
            startTime             endTime activity
1 2014-01-06 01:46:05 2014-01-06 01:48:06        7
2 2014-01-06 01:48:06 2014-01-06 02:00:24        3
3 2014-01-06 02:00:24 2014-01-06 02:02:23        7
4 2014-01-06 02:02:23 2014-01-06 02:10:34        3
5 2014-01-06 02:10:34 2014-01-06 02:19:39        7

merge_step_deltas

As expected, the datastream gives step counts in the time period. Again, the function to import is provided.

> steps <- import_steps(path_to_JSON_file)
# Few sample rows
            startTime             endTime steps
1 2014-01-06 01:45:18 2014-01-06 01:46:35   151
2 2014-01-06 01:46:35 2014-01-06 01:46:42    12
3 2014-01-06 01:46:42 2014-01-06 01:47:35    49
4 2014-01-06 01:47:51 2014-01-06 01:48:51     1
5 2014-01-06 01:48:51 2014-01-06 01:48:51     1

derive_step_cadence<-raw:com.google.step_count.cumulative:LGE:Nexus 5:XXXXXXXX:Step Counter

The last datastream is very similar to the steps except for that it provides cadence which measures the pace in steps per minute.

> cadence <- import_cadence(path_to_JSON_file)
# Few sample rows
            startTime             endTime    cadence
1 2014-01-06 01:45:18 2014-01-06 01:46:35 116.882370
2 2014-01-06 01:46:35 2014-01-06 01:46:42  98.839378
3 2014-01-06 01:46:42 2014-01-06 01:47:35  55.765633
4 2014-01-06 01:47:35 2014-01-06 01:48:51   0.793142
5 2014-01-06 01:48:51 2014-01-06 01:48:51 170.223389

To decide on selecting above datastreams I explored all my data on Google Fit. The streams start in November 2014 and together have about 300 000 observations. In total that’s about 60MB of JSON data. But if you consider that it’s not even been three months since the start of the service, the calculation is a bit more scary:

Over 3500 readings per day
Over 150 readings per hour
Over 2.5 readings per minut

Preparing the data

All the above data are time series. The simple way to check if there is a simple link between them would be an attempt to just to merge them. Below are the observations counts for original end merged (inner jointer) time series. I tried merging both by start and end dates.

> row_dependencies(distance,location,activity,steps,cadence)

[1] "Distance: 2419"
[1] "Location: 14659"
[1] "Activity: 2219"
[1] "Steps: 11254"
[1] "Cadence: 8383"
[1] "Distance + Location (startTime + endTime): 0"
[1] "Distance + Location (startTime): 2183"
[1] "Distance + Location (endTime): 2150"
[1] "Distance + Activity (startTime): 4"
[1] "Distance + Activity (endTime): 1"
[1] "Location + Activity (startTime): 5"
[1] "Location + Activity (endTime): 5"
[1] "Distance + Steps (startTime): 29"
[1] "Location + Steps (startTime): 45"
[1] "Activity + Steps (startTime): 33"
[1] "Activity + Steps (endTime): 38"
[1] "Steps + Cadence (startTime + endTime): 4698"
[1] "Steps + Cadence (startTime): 4812"
[1] "Steps + Cadence (endTime): 7595"

Conclusions are following:

Distance can be merged with location without any significant data loss
Steps and cadence can be merged by some pretty simple algorithm operating on start and end times
To overlay the two merged groups with detected activity some sort of binning will have to be used

Now that I knew the data very well I had to asked myself a question “What do I want to get out of it?”. My 3 questions were:

What are the times of the week I walk the most ?
What are the top 5 places I spend my time at? I mean the 3 other than work and home.
What’s my daily time distribution of activities? (Walking, running, etc)

One other question was “How much time will I spend running in the next weeks?”. Since it’s February now and I only have data since November the results wouldn’t make to much sense as the weather is only doing to improve

Weekly steps heat map

To prepare a heat map I had to assign readings into time bins. If a reading is entirely in a single bin time there is no problem. However if a reading spans across multiple bins the function divides the measured value equally across all bins. R code for the binning function is:

create_bins <- function(sourceData,startTimeColumn,endTimeColumn,valueColumn,nbins=24) {

  binValues <- matrix(0,nrow = nbins, ncol = 7)
  binSize <- 60*24 / nbins

  for(i in 1:nrow(sourceData)){    

    startTime <- sourceData[i,startTimeColumn]
    startDay <- as.numeric(strftime(startTime,'%u'))
    startBinDay <- (as.numeric(difftime(startTime,trunc(startTime,units = "day"),units = "mins")) %/% binSize) + 1
    startBin <- (startDay - 1) * nbins + startBinDay

    endTime <- sourceData[i,endTimeColumn]
    endDay <- as.numeric(strftime(endTime,'%u'))
    endBinDay <- (as.numeric(difftime(endTime,trunc(endTime,units = "day"),units = "mins")) %/% binSize) + 1
    endBin <- (endDay - 1) * nbins + endBinDay

    binNum <- endBin - startBin +1

    for (j in startBin:endBin)
      binValues[j] <- binValues[j] + sourceData[i,valueColumn] / binNum

  }

  return (binValues)

}

A quick check comparing sum of not binned and binned steps proved the function to work correctly.

> create_bins(steps,'startTime','endTime','steps') -> steps_binned
> sum(steps_binned)
[1] 436204
> sum(steps$steps)
[1] 436204

Conclusions? My most intensive time is the daily commute to work, lunch times and some weekend activities.

Where do I spend my time

In this part of analysis I followed the ideas of Jeffrey Evans, with the remark that the literature widely accepts k-means based methods for spatial data clustering. The clara method proved quite efficient so I modified his function a bit.

opt_cluster <- function(x, k_max=10,k_chosen=10, ...) {
  
  if (!require(cluster)) stop("cluster PACKAGE MISSING")
  asw <- numeric(k_max)
  
  for (k in 2:k_max) {  
    asw[k] <- clara(x, k, ...)$silinfo$avg.width
  }
  
  k.best <- which.max(asw)
  
  print(paste("OPTIMAL-K", k.best, sep=": "))
    
  plot(1:k_max, asw, type="s", main="Clustering Optimization using K-Mediods",
       xlab="K (number of clusters)", ylab = "mean silhouette width")
  axis(1, k.best, paste("best",k.best,sep="\n"), col="red",col.axis="red")
    
  return(clara(x, k_chosen, ...))

}

While the function recommended 3 clusters, I decided to override the recommendation. I won’t post specific locations here cause that data is a bit too sensitive, but it is pretty easy to explain that happened in the results. I traveled a lot around the world in the time period. Places I normally often go to got grouped to the location of my home and the two other places were locations where I traveled to (very precise by the way)

The clustering code was

cl <- opt_cluster(location[,c('lat','long')],k_max=40,k_chosen=23)
location_clustered <- cbind(location, cluster=cl$clustering)

An aggregation of the point counts gave me:

> sort(table(location_clustered[,'cluster']),decreasing = TRUE)

  18   16   20   17    2    7    4   22    1   19    6    3   13   10   14    5    8   15   23   11    9   12   21 
7697 4088  768  542  255  232  202  126  115  111   79   77   50   48   46   37   33   33   33   26   25   18   18

Where the clusters are:

18 – Home
16 – Work
20 – My parent’s home
17 – A point in the center of my daily work and home commute – that’s scarrying
2 – A hotel I stayed in during vacations. That’s a pretty surprising place, cause I stayed at other hotels longer. This might be related to Wifi readings.

Daily activities distribution

Finally, let’s calculate my the percentages of time I spent on particular activities …

> activity_duration <- cbind(activity,duration= as.numeric(difftime(activity$endTime,activity$startTime,units = "mins"))) > total_durations <- aggregate(duration ~ activity, activity_duration, sum) > duration_percentage <- cbind(total_duration, percentage = round(total_duration[,'duration'] / sum(total_duration[,'duration']),4)) > duration_percentage
  activity  duration percentage
1        0  7656.567     0.0703
2        1     1.350     0.0000
3        3 97242.733     0.8924
4        7  3926.300     0.0360
5        8    16.750     0.0002
6       24   120.000     0.0011

Not too much running. I need to improve.

Wrap up

Google Fit is a service that offers a broad set of data that could be a basis of many interesting research. It’s still a new service – span of the data is pretty short, documentation is not complete and there is no friendly way to get the data out.

Ruby on Rails
R