Intro
Wearables are hot these days. I already got used to wearing my Jawbone wristband 24/7 and after a year of using Pebble I’ll newer go back to my old Casio watch again. In this post I’ll put down a few words about my play with Google’s platform for everyday activities data collected from wearables – Fit
Google Fit is like a huge datastore for readings from sensors, most notable being of course Android based phones. It’s a datastore that is also pretty open 0 any application can read and write data in there. The complete data model can be found here. I’ll just briefly point out key elements:
- Datastreams – datastream is basically a source of data. It can be raw data like phone sensor readings or users input. There are also some more sophisticated datastreams as Google is doing some processing on it. Each datastream consists of many datapoints which are values of specific datatypes over time. Unfortunately, there is little documentation on how the values in specific streams are calculated. Only this post by David Schmidt gives some hints on what they mean
- Datasets – Datasets are just values of the datapoints over time. You cannot just get all data from a datastream. You have to query for a specific time range. Google expects you to provide start and end date as 64 bit integers with nanoseconds from epoch. This requires some calculations but makes the service future proof. So far I found nothing with such a precision in my data.
- Sessions – sessions are supposed to be used to record user activities . You input start date, end date and activity type to record what you were doing. Personally, I think Google guys copied the idea of recording activities by pressing the button on the Jawbone band. As of now my Google Fit app did not show any sessions even though I input a few activities in there
Getting the data
Having read the documentation I started playing, to figure out what Google knows about my living and training habits. The first natural choice to see `what’s in the data` was Google APIs Explorer. Just a few minutes of work with the tool proved the old saying about data analysis to be still true – 80% of time is spent on data preparation. Google’s tool wasn’t too friendly with input parameters and would kill the browser whenever the output file was larger than few hundred kilobytes. Certainly I needed another solution.
And I had to develop one. The choice (after a few tries) was a Ruby on Rails application that would speed up data extraction. Getting an application prototype up and running is a pretty smooth process in RoR, though I still had to implement two mechanisms:
- oauth2 authentication – to get the data from Google Fit you have to impersonate the user
- Some sort of data extraction interface
So here comes the app.
The UI is based on the most common stack of open source components like Bootstrap, SSAS, etc.
Authentication is via Google Account only. Default view for the user is the list of his datastreams (datasources). After logging in, the user must refresh the list manually. As the picture shows, I already have 27 streams in Google Fit, most of them coming from com.google.android.gms which is Google Play service running in the background on phone. The list lets you either to get JSON specification of the datastream or dig in deeper to the datastream extraction form.
Submitting the form gives you a nice formatted JSON file with your datapoints.
Now that I had the data and an easy way to get it out, it was time dig into it.
Exploring the data
27 datastreams is quite a lot. Some of them overlapped or are preprocessed so browsing through them was the first step for any further analysis. At this part R came handful. The get the data in also structure the code I decided to bundle it into an R package I called Fitness. It can be found at:
So simply type in
devtools::install_github('ms32035/R-fitness') #And don't forget to load it library(fitness)
to get the package installed into your R.
After browsing the downloaded JSON files and their respective structures I decided to go on with following datastreams:
pruned_distance
The dataset contains the distance in meters traveled in the reading period. The data is filtered and preprocessed by Google in this datastream so I expect it to be clean without outliers.
> distance <- import_distance(path_to_JSON_file) # Few sample rows startTime endTime val 1 2014-11-27 08:31:02 2014-11-27 08:38:34 265.9062 2 2014-11-27 08:38:34 2014-11-27 08:42:03 574.2574 3 2014-11-27 08:42:03 2014-11-27 08:45:07 626.4064 4 2014-11-27 08:45:07 2014-11-27 08:47:39 624.4772 5 2014-11-27 08:47:39 2014-11-27 08:50:19 607.6943
merge_location_samples
The dataset contains data about cell phone positions recorded. Analogically to the previous stream, it is also preprocessed. Despite having start and end time values are always sing time points.
> location <- import_location(path_to_JSON_file) # Few sample rows (fake coordinate values - that's sensitive information) startTime endTime lat long alt 1 2014-11-14 19:35:21 2014-11-14 19:35:21 42.60910 -21.3315 31.500 2 2014-11-14 19:37:54 2014-11-14 19:37:54 42.60912 -21.3315 25.505 3 2014-11-14 19:40:25 2014-11-14 19:40:25 42.61017 -21.3314 38.000 4 2014-11-14 19:43:49 2014-11-14 19:43:49 42.61018 -21.3319 30.000 5 2014-11-14 19:49:50 2014-11-14 19:49:50 42.60911 -21.3316 48.000
merge_activity_segments
This is a pretty interesting datastream. The values are com.google.activity.segment
The details how they are calculated aren’t public, but since the source is com.google.android.gms
I guess it a sum of activities detected on the phone. My levels of the variable were
> levels(activity$activity) [1] "0" "1" "3" "7" "8" "24"
which respectively stands for: In vehicle, Biking, Still (not moving), Walking, Running, Dancing. It’s seems pretty obvious that the values are calculated basing on distance / speed. Still 2 of them were a bit mysterious to me.
- Biking – I don’t remember doing any biking in the time period. There was just one observation – a false reading or interpretation by Google
- Dancing – Can Google really detect dancing from accelerometer data? No, that was my manual input in Google Fit app. A more interesting question is why Google decided to handle this as a datapoint value and not as a session.
The contingency table for my activities looks like this:
> table(activity$activity) 0 1 3 7 8 24 288 1 1017 903 9 1
Importing is identical as before.
> activity <- import_activity(path_to_JSON_file) # Few sample rows startTime endTime activity 1 2014-01-06 01:46:05 2014-01-06 01:48:06 7 2 2014-01-06 01:48:06 2014-01-06 02:00:24 3 3 2014-01-06 02:00:24 2014-01-06 02:02:23 7 4 2014-01-06 02:02:23 2014-01-06 02:10:34 3 5 2014-01-06 02:10:34 2014-01-06 02:19:39 7
merge_step_deltas
As expected, the datastream gives step counts in the time period. Again, the function to import is provided.
> steps <- import_steps(path_to_JSON_file) # Few sample rows startTime endTime steps 1 2014-01-06 01:45:18 2014-01-06 01:46:35 151 2 2014-01-06 01:46:35 2014-01-06 01:46:42 12 3 2014-01-06 01:46:42 2014-01-06 01:47:35 49 4 2014-01-06 01:47:51 2014-01-06 01:48:51 1 5 2014-01-06 01:48:51 2014-01-06 01:48:51 1
derive_step_cadence<-raw:com.google.step_count.cumulative:LGE:Nexus 5:XXXXXXXX:Step Counter
The last datastream is very similar to the steps except for that it provides cadence which measures the pace in steps per minute.
> cadence <- import_cadence(path_to_JSON_file) # Few sample rows startTime endTime cadence 1 2014-01-06 01:45:18 2014-01-06 01:46:35 116.882370 2 2014-01-06 01:46:35 2014-01-06 01:46:42 98.839378 3 2014-01-06 01:46:42 2014-01-06 01:47:35 55.765633 4 2014-01-06 01:47:35 2014-01-06 01:48:51 0.793142 5 2014-01-06 01:48:51 2014-01-06 01:48:51 170.223389
To decide on selecting above datastreams I explored all my data on Google Fit. The streams start in November 2014 and together have about 300 000 observations. In total that’s about 60MB of JSON data. But if you consider that it’s not even been three months since the start of the service, the calculation is a bit more scary:
- Over 3500 readings per day
- Over 150 readings per hour
- Over 2.5 readings per minut
Preparing the data
All the above data are time series. The simple way to check if there is a simple link between them would be an attempt to just to merge them. Below are the observations counts for original end merged (inner jointer) time series. I tried merging both by start and end dates.
> row_dependencies(distance,location,activity,steps,cadence) [1] "Distance: 2419" [1] "Location: 14659" [1] "Activity: 2219" [1] "Steps: 11254" [1] "Cadence: 8383" [1] "Distance + Location (startTime + endTime): 0" [1] "Distance + Location (startTime): 2183" [1] "Distance + Location (endTime): 2150" [1] "Distance + Activity (startTime): 4" [1] "Distance + Activity (endTime): 1" [1] "Location + Activity (startTime): 5" [1] "Location + Activity (endTime): 5" [1] "Distance + Steps (startTime): 29" [1] "Location + Steps (startTime): 45" [1] "Activity + Steps (startTime): 33" [1] "Activity + Steps (endTime): 38" [1] "Steps + Cadence (startTime + endTime): 4698" [1] "Steps + Cadence (startTime): 4812" [1] "Steps + Cadence (endTime): 7595"
Conclusions are following:
- Distance can be merged with location without any significant data loss
- Steps and cadence can be merged by some pretty simple algorithm operating on start and end times
- To overlay the two merged groups with detected activity some sort of binning will have to be used
Now that I knew the data very well I had to asked myself a question “What do I want to get out of it?”. My 3 questions were:
- What are the times of the week I walk the most ?
- What are the top 5 places I spend my time at? I mean the 3 other than work and home.
- What’s my daily time distribution of activities? (Walking, running, etc)
One other question was “How much time will I spend running in the next weeks?”. Since it’s February now and I only have data since November the results wouldn’t make to much sense as the weather is only doing to improve
Weekly steps heat map
To prepare a heat map I had to assign readings into time bins. If a reading is entirely in a single bin time there is no problem. However if a reading spans across multiple bins the function divides the measured value equally across all bins. R code for the binning function is:
create_bins <- function(sourceData,startTimeColumn,endTimeColumn,valueColumn,nbins=24) { binValues <- matrix(0,nrow = nbins, ncol = 7) binSize <- 60*24 / nbins for(i in 1:nrow(sourceData)){ startTime <- sourceData[i,startTimeColumn] startDay <- as.numeric(strftime(startTime,'%u')) startBinDay <- (as.numeric(difftime(startTime,trunc(startTime,units = "day"),units = "mins")) %/% binSize) + 1 startBin <- (startDay - 1) * nbins + startBinDay endTime <- sourceData[i,endTimeColumn] endDay <- as.numeric(strftime(endTime,'%u')) endBinDay <- (as.numeric(difftime(endTime,trunc(endTime,units = "day"),units = "mins")) %/% binSize) + 1 endBin <- (endDay - 1) * nbins + endBinDay binNum <- endBin - startBin +1 for (j in startBin:endBin) binValues[j] <- binValues[j] + sourceData[i,valueColumn] / binNum } return (binValues) }
A quick check comparing sum of not binned and binned steps proved the function to work correctly.
> create_bins(steps,'startTime','endTime','steps') -> steps_binned > sum(steps_binned) [1] 436204 > sum(steps$steps) [1] 436204
So here is a nice heat map generated using D3.
Conclusions? My most intensive time is the daily commute to work, lunch times and some weekend activities.
Where do I spend my time
In this part of analysis I followed the ideas of Jeffrey Evans, with the remark that the literature widely accepts k-means based methods for spatial data clustering. The clara method proved quite efficient so I modified his function a bit.
opt_cluster <- function(x, k_max=10,k_chosen=10, ...) { if (!require(cluster)) stop("cluster PACKAGE MISSING") asw <- numeric(k_max) for (k in 2:k_max) { asw[k] <- clara(x, k, ...)$silinfo$avg.width } k.best <- which.max(asw) print(paste("OPTIMAL-K", k.best, sep=": ")) plot(1:k_max, asw, type="s", main="Clustering Optimization using K-Mediods", xlab="K (number of clusters)", ylab = "mean silhouette width") axis(1, k.best, paste("best",k.best,sep="\n"), col="red",col.axis="red") return(clara(x, k_chosen, ...)) }
While the function recommended 3 clusters, I decided to override the recommendation. I won’t post specific locations here cause that data is a bit too sensitive, but it is pretty easy to explain that happened in the results. I traveled a lot around the world in the time period. Places I normally often go to got grouped to the location of my home and the two other places were locations where I traveled to (very precise by the way)
The clustering code was
cl <- opt_cluster(location[,c('lat','long')],k_max=40,k_chosen=23) location_clustered <- cbind(location, cluster=cl$clustering)
An aggregation of the point counts gave me:
> sort(table(location_clustered[,'cluster']),decreasing = TRUE) 18 16 20 17 2 7 4 22 1 19 6 3 13 10 14 5 8 15 23 11 9 12 21 7697 4088 768 542 255 232 202 126 115 111 79 77 50 48 46 37 33 33 33 26 25 18 18
Where the clusters are:
- 18 – Home
- 16 – Work
- 20 – My parent’s home
- 17 – A point in the center of my daily work and home commute – that’s scarrying
- 2 – A hotel I stayed in during vacations. That’s a pretty surprising place, cause I stayed at other hotels longer. This might be related to Wifi readings.
Daily activities distribution
Finally, let’s calculate my the percentages of time I spent on particular activities …
> activity_duration <- cbind(activity,duration= as.numeric(difftime(activity$endTime,activity$startTime,units = "mins"))) > total_durations <- aggregate(duration ~ activity, activity_duration, sum) > duration_percentage <- cbind(total_duration, percentage = round(total_duration[,'duration'] / sum(total_duration[,'duration']),4)) > duration_percentage activity duration percentage 1 0 7656.567 0.0703 2 1 1.350 0.0000 3 3 97242.733 0.8924 4 7 3926.300 0.0360 5 8 16.750 0.0002 6 24 120.000 0.0011
… and put it on a nice D3 histogram.
Not too much running. I need to improve.
Wrap up
Google Fit is a service that offers a broad set of data that could be a basis of many interesting research. It’s still a new service – span of the data is pretty short, documentation is not complete and there is no friendly way to get the data out.
Powered by:
- Ruby on Rails
- R
- D3