I. Introduction¶

Insights into my fitness journey (2012 - 17)¶

I have been an avid runner and fitness enthusiast for the past 5 years, and what has been a fun part of all those is how much we can learn about ourselves from exploring the data we capture.

I started capturing data about myself around the year 2012, which was when we saw the advent of wearable devices, and have always wanted to carry out an analysis to gain insights about my running patterns, my strength training schedules, my diet logs and any other metadata which I can collect about myself like the Heart Rate, Cadence for runs and also the step counts.

About Quantified Self : When learning about the ways to explore about the data you capture, I came across the term Quantified Self and have been following the blog ever since. Its just amazing how capturing and analyzing data about yourself can actually help me make data driven decision to improve your lifestyle and so many people are joining this movement and bringing a data-driven change to improve their lives.

Questions which I would like to answer in my project analysis are as follows :

How much time have I spent on various activities like Running, Treadmill Running, Strength Training Sessions, Cycling and Cardio Sessions?
What is a count of the activities and how do they vary across specific months over the span of 5 years?
Activity Day Analysis : What has been the most active day in the week and the distribution of activities across specific days of a week
Calorie Analysis : An analysis of the distribution of calories burnt during activities, along with the distribution of calories burnt during specific activities and detecting outliers (which I am happy about)
How many calories have I burnt in total for every activity and see any trends for the activities across the years 2012-2017? Interactive Plot using D3.js
What is the hour of the day that I usually do the activities and how has it changed over the past 5 years? It will help me relive the changes that have happened in my life and how they have affected my schedules. Interactive Plot using D3.js
Analyzing the trend of how the Activity minutes vary across the months for years 2012-2016. Interactive D3.js visualization
Analyzing the trend of how my running mileage varies across the months for the year data 2012-2016. Interactive Plot using D3.js
Analysis of the total minutes spent on each activity type during the months in year 2012-17
Analyzing average calories burnt for every activity type during the months of years 2012-17
Analyzing average heart rate based activity levels and its variation along the aggregated days of the week and aggregated activity types

I do take inspiration from the Quantified Self Running archives and will try and do an analysis on all these aspects about my data.

Where the reader can find the data? I would like to share the data captured about all my activities if someone wants to analyze it. The data used for this project is hosted on Github at this link. This code file has also been hosted on Github and can be accessed at this link. You can always mail me to get the most updated data at anujk3@gmail.com .

II. Team¶

I will be working by myself on my dataset and gaining insights through the analysis I do. The stages of the project have been highlighted below:

I had to go to the activities page on Garmin and download around 22 pages of data one by one which had all the data in a csv format. Sadly, Garmin does not offer an API like fitbit which makes it difficult to gather any data for analysis.
After getting the data from Garmin, the next part was combining all the csv files together and converting it into a dataframe which can be analyzed further. All the processing was done in python using pandas.
The initial processing involved cleaning the data, remove extraneous activities like walking, swimming etc. which I haven't tracked a lot, correcting the data errors, correcting datetime fields for the ease of datetime analysis.
After initial cleaning of data, I was focussed on data transformations to add dervied columns to the data which can be useful to exploratory data analysis.
The next part was focussed on doing an analysis of missing values followed by the existing continuous and discrete variables in the dataset which involed an analysis of the missing values along with an analysis of important univariate distributions of continuous columns and multivariate analysis of continuous variables.
The next part was to figure out the important questions which I can answer using the dataset and work on specific problems to come with relevant visualizations for explaning them.

Programming Language Used :¶

Something I am really passionate about is doing data analysis work and Python has been my go to language for it. Through the course Exploratory Data Analysis and Visualizations, Prof. Joyce Robbins did help me realize how even R can help in quickly analyzing the data and actually the best part about R for me was getting to learn about ggplot.

I still prefer using Python as my language for data analysis but hacked my way to use ggplot within the Python Analysis and will be doing most of my dataframe analysis in Python and making plots with ggplot, but all within a Python Notebook.

I will also be working on interactive visualizations using D3.js and whenever possible matplotlib within Python along with seaborn.

III. Analysis of Data Quality¶

i. Import Statements and Setup¶

In [1]:

import pandas as pd
import math
import warnings
import numpy as np
from rpy2.robjects import pandas2ri
pandas2ri.activate()
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = (20,20)
%reload_ext rpy2.ipython
%matplotlib inline
pd.options.display.max_rows = 999
sns.set_style("white")

In [1]:

from IPython.core.display import HTML
HTML("""

""")

Out[1]:

ii. Load Data¶

In [3]:

path = "data/a"

frame = pd.DataFrame()
list_ = []

for i in range(1, 27):
    file_path = path + str(i)+ ".csv"
    #print(file_path)
    df = pd.read_csv(file_path, index_col=None, header=0, skiprows=2)
    list_.append(df)
    frame = pd.concat(list_)
    
frame.reset_index(inplace=True)
frame.drop("index", axis=1, inplace=True)
frame.drop(" Favorite", axis=1, inplace=True)
frame.drop("Unnamed: 11", axis=1, inplace=True)

In [4]:

#frame.columns

In [5]:

col_names = ['activityType', 'startTime', 'activityTime', 'actDistance', 'elevationGain', 'avgSpeed', 'avgHR', 'maxHR', 
            'steps', 'calories']
frame.columns = col_names

iii. Data Cleaning¶

Data Cleaning has been performed in the following steps to ease the process for creating the visualizations later.

Ignore the swimming activity, only 1 row

In [6]:

frame.drop(450, axis=0, inplace=True)

Convert Elliptical to Cardio

In [7]:

frame[frame["activityType"] == "Elliptical"]
frame.loc[384, "activityType"] = "Cardio"

Ignore Walking as an activity

In [8]:

frame.drop(frame[frame["activityType"] == "Walking"].index, axis=0, inplace=True)

Ignore swimming as an activity as only 3 recorded activities

In [9]:

frame[frame["activityType"] == "Lap Swimming"].index
frame.drop(frame[frame["activityType"] == "Lap Swimming"].index, axis=0, inplace=True)

Activity type Strength Training was recording as Others in older Garmin Devices

In [10]:

frame.drop([214, 221, 223, 242], axis=0, inplace=True)
frame.activityType = frame.activityType.str.replace("Other", "Strength Training")

Activity Types Indoor Rowing and Indoor Cycling also included as Cardio

In [11]:

frame.activityType = frame.activityType.str.replace("Indoor Rowing", "Cardio")
frame.activityType = frame.activityType.str.replace("Indoor Cycling", "Cardio")

Correcting Data Errors and Data Types

In [12]:

frame.loc[frame["activityType"] == "Strength Training", "actDistance"] = np.NaN
frame.loc[frame["activityType"] == "Strength Training", "elevationGain"] = np.NaN
frame.loc[frame["activityType"] == "Strength Training", "avgSpeed"] = np.NaN
frame.loc[frame["activityType"] == "Strength Training", "steps"] = np.NaN
frame.loc[frame["activityType"] == "Cardio", "actDistance"] = np.NaN
frame.loc[frame["activityType"] == "Cardio", "elevationGain"] = np.NaN
frame.loc[frame["activityType"] == "Cardio", "avgSpeed"] = np.NaN
frame.loc[frame["activityType"] == "Cardio", "steps"] = np.NaN

In [13]:

frame["avgHR"] = frame["avgHR"].replace(['--', '0'], np.NaN)
frame["maxHR"] = frame["maxHR"].replace(['--', '0'], np.NaN)
frame["steps"] = frame["steps"].replace(['--', '0'], np.NaN)
frame["steps"] = frame.steps.str.replace(",", "")
frame['avgHR'] = frame.avgHR.astype(float)
frame['maxHR'] = frame.maxHR.astype(float)
frame['steps'] = frame.steps.astype(float)

In [14]:

frame["elevationGain"] = frame["elevationGain"].replace(['--', '0', 0], np.NaN)
frame.loc[[456, 495], "elevationGain"] = 328
frame["elevationGain"] = frame.elevationGain.str.replace(",", "")
frame['elevationGain'] = frame.elevationGain.astype(float)
frame["avgSpeed"] = frame.avgSpeed.str.replace(":", ".")
frame["avgSpeed"] = frame.avgSpeed.str.replace("--.--", '0.0')
frame['avgSpeed'] = frame.avgSpeed.astype(float)
frame['actDistance'] = frame.actDistance.astype(float)
frame["calories"] = frame.calories.apply(lambda x: str(x).replace(",", ""))
frame['calories'] = frame.calories.astype(float)

In [15]:

frame.drop(frame[frame["actDistance"] < 0.01].index, axis=0, inplace=True)
frame.reset_index(inplace=True)
frame.drop("index", axis=1, inplace=True)

In [16]:

running_data = frame.copy()
running_data.to_csv("vis1.csv")

iv. Exploring the data and the datatypes¶

Initial Peek into the data

In [17]:

running_data.head()

Out[17]:

	activityType	startTime	activityTime	actDistance	elevationGain	avgSpeed	avgHR	maxHR	steps	calories
0	Strength Training	Thu, 6 Apr 2017 10:43 PM	1:09:02	NaN	NaN	NaN	119.0	163.0	NaN	597.0
1	Strength Training	Tue, 4 Apr 2017 10:36 PM	54:37	NaN	NaN	NaN	117.0	155.0	NaN	474.0
2	Strength Training	Sun, 2 Apr 2017 10:46 PM	1:12:20	NaN	NaN	NaN	116.0	159.0	NaN	601.0
3	Strength Training	Tue, 28 Mar 2017 11:02 PM	42:11	NaN	NaN	NaN	99.0	142.0	NaN	257.0
4	Strength Training	Mon, 27 Mar 2017 11:17 PM	45:42	NaN	NaN	NaN	120.0	165.0	NaN	412.0

v. Data Transformations¶

Creating Derived Fields that can be used for the visualizations. The transformations have been highlighted.

Converting startTime to Python DateTime Format to use the inbuilt functions

In [18]:

running_data["startTime"] = pd.to_datetime(running_data["startTime"])
#running_data.head()

Concatenating information about the Month and Year for every activity to the DataFrame

In [19]:

running_data["activityMonth"] = running_data.startTime.dt.month
running_data["activityDay"] = running_data.startTime.dt.dayofweek

In [20]:

running_data["activityMonthName"] = running_data.activityMonth.map({1:"January", 2:"February", 3:"March", 4:"April",
                                                                   5:"May", 6:"June", 7:"July", 8:"August", 
                                                                   9:"September", 10:"October", 11:"November", 12:"December"})

In [21]:

running_data["activityDayName"] = running_data.activityDay.map({0:"Monday", 1:"Tuesday", 2:"Wednesday", 3:"Thursday",
                                                                   4:"Friday", 5:"Saturday", 6:"Sunday"})

Obtaining Activitiy Levels using the average HR field for all the workouts

In [22]:

## To-Do : Divide the Heart Rate Zones from the data into Zones and create a new column
# < 114 - Very Light
# 114-133 - Light
# 133-152 - Moderate
# 152-171 - Hard
# >171 - Very Hard
## do using lambda: apply a function

def get_hr_zones(avgHR):
    if math.isnan(avgHR):
        return "Not Recorded"
    if avgHR < 114:
        return "Very Light"
    elif avgHR >= 114 and avgHR < 133:
        return "Light"
    elif avgHR >= 133 and avgHR < 152:
        return "Moderate"
    elif avgHR >= 152 and avgHR < 171:
        return "Hard"
    else:
        return "Very Hard"

In [23]:

running_data["activityLevel"] = running_data.avgHR.apply(get_hr_zones)

Calculating total minutes of an activity from the data

In [24]:

def getMinutes(activityTime):
    curr_time = activityTime.split(":")
    if len(curr_time) == 2:
        final_time = curr_time[0] + "." + curr_time[1]
    else:
#         print curr_time
        mins = int(curr_time[0])*60 + int(curr_time[1])
        final_time = str(mins) + "." + curr_time[2]
    return final_time

In [25]:

running_data["activityMins"] = running_data.activityTime.apply(getMinutes)
running_data["activityMins"] = running_data["activityMins"].astype("float")

In [26]:

running_data.drop("activityTime", axis=1, inplace=True)
#running_data.head()

In [27]:

#running_data.head()
running_data.to_csv("vis2.csv")

Checking final data types

In [28]:

running_data.dtypes

Out[28]:

activityType                 object
startTime            datetime64[ns]
actDistance                 float64
elevationGain               float64
avgSpeed                    float64
avgHR                       float64
maxHR                       float64
steps                       float64
calories                    float64
activityMonth                 int64
activityDay                   int64
activityMonthName            object
activityDayName              object
activityLevel                object
activityMins                float64
dtype: object

Creating a list of continuous and categorical variable

In [29]:

categorical_vars = running_data.describe(include=["object"]).columns
continuous_vars = running_data.describe().columns

vi. Summary of Continuous Variables and Missing Data Analysis¶

The summary of continuous variables helps us figure out the count of values for all the variables where the values are present. It helps us explore the number of missing values for the variables and begin the analysis of MISSING DATA

In [30]:

running_data.describe()

Out[30]:

	actDistance	elevationGain	avgSpeed	avgHR	maxHR	steps	calories	activityMonth	activityDay	activityMins
count	211.000000	80.000000	211.000000	336.000000	336.000000	82.000000	494.000000	494.000000	494.000000	494.000000
mean	5.018436	208.725000	9.970711	123.291667	157.639881	8519.463415	585.524291	6.447368	3.192308	63.260020
std	4.160373	284.234659	3.127073	17.777179	14.814820	8494.509896	455.932193	3.712304	2.070135	42.252763
min	0.240000	3.000000	0.000000	77.000000	81.000000	404.000000	0.000000	1.000000	0.000000	1.270000
25%	2.470000	51.500000	8.580000	111.000000	150.000000	3783.500000	244.500000	3.000000	1.000000	28.437500
50%	3.550000	112.500000	9.470000	121.000000	159.000000	4753.000000	482.500000	7.000000	3.000000	58.560000
75%	6.300000	257.500000	11.005000	135.000000	168.000000	9205.000000	816.500000	10.000000	5.000000	91.367500
max	26.340000	1943.000000	43.370000	168.000000	188.000000	42808.000000	3362.000000	12.000000	6.000000	280.590000

Looking at the count of values for all the continuous variables, we observe that out of 494 activities, 211 activities had distance recorded, 80 activities had elevationGains recorded, the average speed was recorded for 211 activities, with average and max Heart Rate being recorded for 336 activities.

Calories, Activity Month, Activity Day and the Activity Minutes were recorded for all the activities

Some of the reasons for missing data are as follows:

During the span of 5 years, I have changed 4 wearable devices. Initially I used the Garmin Vivofit tracker for all the runs, which did not have heart rate tracker.
After that I bought the Garmin Forerunner 920 (GPS inbuilt), which had a heart rate strap along with it and it also used to calculate the elevationGain and many more parameters for all the runs. But I do realize that I was lazy to use the heart rate strap as it needed to be washed after every run.
After that I used the Garmin Vivofit2 tracker, which had inbuilt heart rate monitor but lacked GPS, so the runs were still being tracked by Garmin Forerunner 920 and whenever were tracked using Vivofit2, lacked heart rate information.
Currently I am using the Garmin Vivoactive HR, which has an wrist HR monitor as well as an inbuilt GPS, but being in Graduate School, the frequency of runs has gone down significantly.

All the variables in the analysis are important but lets do an initial missing data analysis with visualizations

Visualization : Missing Data 1

In [31]:

running_data_continuous = running_data[running_data.describe().columns].copy()

In [32]:

%%R -i running_data_continuous -w 900 -h 480 -u px

library(vcd)
library(dplyr)
library(readr)
library(grid) # for gpar
library(RColorBrewer)
library(scales)
library(knitr)
library(mi)

image(missing_data.frame(running_data_continuous))

rm(list = ls())

NOTE: The following pairs of variables appear to have the same missingness pattern.
 Please verify whether they are in fact logically distinct variables.
     [,1]    [,2]   
[1,] "avgHR" "maxHR"

Visualization : Missing Data 2

In [33]:

%%R -i running_data_continuous -w 900 -h 480 -u px

library(vcd)
library(dplyr)
library(grid) # for gpar
library(RColorBrewer)
library("VIM")
library("mice")
library(lattice)

aggr_plot <- aggr(running_data_continuous, col=c('skyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(running_data_continuous), cex.axis=.7, gap=3, ylab=c("Heatmap of missing data"), combined= TRUE)
rm(list = ls())

 Variables sorted by number of missings: 
      Variable Count
 elevationGain   414
         steps   412
   actDistance   283
      avgSpeed   283
         avgHR   158
         maxHR   158
      calories     0
 activityMonth     0
   activityDay     0
  activityMins     0

Looking at the above visualization, the following patterns emerge:

Average Heart Rate and Maximum Heart Rate values are missing for those specific Activities where I was using a fitness tracker which did not have Heart Rate Tracking ability.
The statistics like Elevation Gain were tracked when I was using a heart rate strap attached, which being lazy I didn't use often, leading to the highest number of missing values, because of which we will not consider it as an important feature in our analysis.
For the first generation fitness tracker used like Garmin Vivofit 1 and 2, even though it had capabilities to record my Running Distance based on the number of steps take, it still did not process the other metrics like Average Speed and Heart rate values.
Also, the Running Distance values are missing for all workouts which involved Strength Training. If there was some recorded value for this field during Strength Training sessions due to the number of steps taken, it has been ignored and replaced with NaNs in my analysis.
Steps were recorded by the most recent fitness trackers used and therefore have a lot of missing values. During activities not involving Running, the Steps have not been considered and replaced with NaNs.

From the above analysis, it can be seen that the most important features for further analysis are Activity Minutes, Activity Month, Activity Day, Calories Burnt, Average Heart Rate, Maximum Heart Rate, Activity Distance Covered. The other features, specifically, Average Speed , Steps and Elevation Gain have a higher number of missing values and will not be considered important for future analysis, even though Average Speed and Steps can help me analyze the runs and find some correlations.

Visualizing Distribution of Continuous Variables

In [34]:

_ = running_data.hist(column=continuous_vars, figsize = (16,16))

vi. Visualizing Distribution of Categorical Variables¶

In [35]:

print(categorical_vars)

Index(['activityType', 'activityMonthName', 'activityDayName',
       'activityLevel'],
      dtype='object')

In [36]:

# Count plots of categorical variables

fig, axes = plt.subplots(4, 3, figsize=(16, 16))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.7, hspace=0.3)

for i, ax in enumerate(axes.ravel()):
    if i > 3:
        ax.set_visible(False)
        continue
    sns.countplot(y = categorical_vars[i], data=running_data, ax=ax)

vii. Multivariate Distributions for Continuous Data¶

In [37]:

_ = pd.scatter_matrix(running_data, alpha=0.2, figsize=(30, 30))

Analysis of the above visualization:

A Scatterplot matrices is a great way to roughly determine if you have a linear correlation between multiple variables. This is particularly helpful in pinpointing specific variables that might have similar correlations amongst the dataset. Some observations are as follows:

Calories burnt show approximately linear correlation with the minutes of a particular activity and the steps taken.
Heteroskedasticity is observed for variables calories with activity Minutes and also activityMins and elevationGain.

IV. Executive Summary¶

I have been an avid runner and fitness enthusiast for the past 5 years, and have been capturing data about my training activities. One of the main motivations to do this analysis was to gain insights into my running patterns, my strength training schedules and any relevant information I can collect using all the data I have collected from my training sessions.

During the span of five years, I have changed 4 wearable devices. Initially I used the Garmin Vivofit tracker for all the runs, which did not have heart rate tracker. After that I used the Garmin Forerunner 920 (GPS inbuilt), which had a heart rate strap along with it and it also used to calculate the elevationGain and many more parameters for all the runs. The third tracker used was the Garmin Vivofit2 tracker, which had inbuilt heart rate monitor but lacked GPS, so the runs were still being tracked by Garmin Forerunner 920 and whenever were tracked using Vivofit2, lacked heart rate information. Currently I am using the Garmin Vivoactive HR, which has an wrist HR monitor as well as an inbuilt GPS.

In the summary, I will highlight the following main insights from the analysis:

Hour of the day I perform the activities and how it has varied in the last 5 years?
For every month of the past 5 years, how has the total activity time varied?
Also, for the last 5 years, how has the running mileage varied and its comparison to the activity times?

For the analysis about the hour of the day I have performed the activities, I created the following interactive visualization using D3.js: Visualization Link (Inspired by learning from the course Storytelling with Data) Code

In [ ]:

HTML('')

About the Visualization : In this specific visualization created using D3.js, the activities have been plotted as a scatter plot in increasing order of time range from 2012 to 2017. Knowing that the activities span a range of 5 years, I made the decision to not keep tick labels for x-axis but added a Data Range Slider below the x-axis to add interactivity to the plot (a date range can be chosen). The y-axis shows the time of the day (0-24 hours) and the activities are plotted as points on the graph. Also interactive buttons can be used to select any activity, like running, cycling etc. and only its patterns can be observed for a given date range.

The main insights that can be obtained from the visualization are as follows:

Most of the activity sessions for the past 5 years have been in the evening, which clearly show that I am not really a morning person.
Most of the morning sessions included running, which were the practice runs while training for marathons.
Since moving to New York in August 2016, the rightmost part of the graph, I have mostly been doing the strength sessions late night at around 22:00 hour and some really late night cycle rides along the riverside, which can be clearly seen in the visualization. The increase in the cycling sessions after moving to New York can be attributed to the annual Citibike membership and daily usage of it during the Winter vacations after Fall 2016 semester.
The fitness tracking journey started with mostly tracking the runs (leftmost part of the chart consist mostly of green dots) and slowly with the improved fitness trackers bought, I started capturing more data about varied activities I did.
And looking at this visualization, I also realize that in the recent times, I have been missing the feel good factor after my long runs, and need to get back to running soon. It is clearly visible that I have mostly been focusing on the strength training sessions (rightmost pink clusters) in the recent times.

To understand the total activity time and how it varied for the last 5 years during specific months, I created the following D3 visualization : Visualization Link (Inspired by learning from the course Storytelling with Data) Code

In [ ]:

HTML('')

About the visualization : In this specific visualization created using D3.js, the total aggregated minutes of the activity types have been plotted on the y-axis with its variation across the months for the years 2012-2016. The trend line plots have been specifically annotated for every year in the analysis. Hovering the mouse on the visualization highlights the closest data points and annotates it in the visualization with the minutes spent in that month. For this visualization, I am using curved lines which help us better understand the underlying variations in the data.

The main insights that can be obtained from the visualization are as follows:

The tracking for the workout sessions began in 2012, with about 200 minutes of activity captured during the month of October 2012, but I was still not much into tracking at that time.
During the year 2013, I started tracking my running activities only and most of the activity minutes for the months in 2013 are due to the runs that have been tracked.
During the year 2014, I started my running journey and joined the Gurgaon Road Runners group and got inspired to track my runs more seriously. Even though I was still only capturing my runs for the year, I had a decent running time during most of the months from March - May. The time spent on running activities reduced because of extreme hot temperatures in India during June and July and then the I started training for a half marathon, due to which the activity minutes picked up during the end months of 2014.
During the year 2015, I just ran a half marathon for the month of January and after that had a cool down period. Because of the advanced fitness trackers bought during 2015, from June I started capturing my strength training sessions along with the running sessions and along with it, I started preparing for a half marathon in December and full marathon in January, because of which there has been a high value of activity minutes recorded for the last half of 2015 season.
For the year 2016, I was running regularly and also training regularly for improving my runs in the gym, because of which there has been such high activity recorded for the year 2016. I also ran my first full marathon in January 2016 and therefore have high activity recorded for the month of January 2016. The reduction in activity minutes as 2016 progressed can be aided to the joining of Graduate school in NY and recording mostly my strength training sessions.

Even though the activity minutes reduced as the year 2016 progressed, I was still capturing all my other training sessions than running and do have decent activity minutes captured. The difference of how graduate school really impacted my running schedule will be clearly visible from the following visualization of just the running kilometers tracked for the years 2014-2016.

And, finally lets look at the running kilometers tracked for every month in the years 2014-2016. The result is shown in the following visualization created using D3.js : Visualization Link (Inspired by learning from the course Storytelling with Data) Code

In [ ]:

HTML('')

About the visualization : In this specific visualization created using D3.js, the running mileage has been plotted for every month of the years 2014-2016 with the running mileage on the y-axis and the months on the x-axis. I am only showing the data for these years as I was mostly capturing all my running activities during these years. The trend line plots have been specifically annotated for every year in the analysis. Hovering the mouse on the visualization highlights the closest data points and annotates it in the visualization with the total kilometers run in that month. For this visualization, I am using curved lines which help us better understand the underlying variations in the data.

The main insights that can be obtained from the visualization are as follows:

For the year 2014, the running time has mostly varied with the running season in India. The running season starts around August and finishes with the Delhi Marathon and Mumbai Marathon in Decemeber and January respectively, showing higher kilometers run during those times of the year.
For the year 2015, after the Mumbai Marathon in January is when you rest and condition your body for the next running season. Even though in the previous plot I see that the recorded activity time is very high for the months in 2015, the actual running miles is less till the beginning of the running season in August-September. As I was preparing for my first full marathon in January 2016, the higher kilometer runs for the month of November and December 2015 are the due to the practice long runs for its preparation.
For the year 2016, the running kilometers are high for the year January when I was doing the practice runs and also ran my first full marathon. After that, I was just running about a long run once a week to keep conditioning along with the rest time after the season. In July-August 2016, I was preparing and moved to NY to join graduate school and during the initial months of moving to NY, I was still logging those kilometers due to the vicinity of Central Park near my house and favorable weather conditions but as the semester proceeded and the workload increased along with extreme temperature, the running kilometers reduced significantly to even less than 10km, which is really insignificant when compared to the previous years peak running months, i.e., November and Decemeber. But I hope to be back to running soon, and doing this analysis just made me realize how important running is for my happiness.

We will now proceed to the section where we will try and obtain more trends from the data, analyze the distributions of important variables in the dataset and obtain learning by analyzing all the created visualizations from the data.

V. Main Data Analysis and Visualizations¶

In this section, we will analyze the identified important features through our data quality analysis, namely, Activity Minutes, Activity Month, Activity Day and Calories Burnt.

i. Analysis of the feature - Activity Minutes¶

a. Distribution Histogram with overlayed Density Plots, with varying binwidths also

In [41]:

%%R -i running_data -w 900 -h 480 -u px

require("ggplot2")

g1 <- ggplot(running_data, aes(x=activityMins)) + 
    geom_histogram(aes(y=..count..),      # Histogram with density instead of count on y-axis
                   binwidth=1,
                   colour="black", fill="white") +
    geom_density(aes(y=..count..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Count") +
    xlab("Activity Minutes") +
    ggtitle("Activity Minutes overlayed with density estimate : Binwidth 1") 
g1

In [42]:

%%R -i running_data -w 900 -h 480 -u px

require("ggplot2")

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
g1 <- ggplot(running_data, aes(x=activityMins)) + 
    geom_histogram(aes(y=..count..),      # Histogram with density instead of count on y-axis
                   binwidth=1,
                   colour="black", fill="white") +
    geom_density(aes(y=..count..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Count") +
    xlab("Activity Minutes") +
    ggtitle("Activity Minutes overlayed with density estimate : Binwidth 1")

g2 <- ggplot(running_data, aes(x=activityMins)) + 
    geom_histogram(aes(y=..count..),      # Histogram with density instead of count on y-axis
                   binwidth=2,
                   colour="black", fill="white") +
    geom_density(aes(y=..count..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Count") +
    xlab("Activity Minutes") +
    ggtitle("Activity Minutes overlayed with density estimate : Binwidth 2")

g3 <- ggplot(running_data, aes(x=activityMins)) + 
    geom_histogram(aes(y=..count..),      # Histogram with density instead of count on y-axis
                   binwidth=4,
                   colour="black", fill="white") +
    geom_density(aes(y=..count..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Count") +
    xlab("Activity Minutes") +
    ggtitle("Activity Minutes overlayed with density estimate : Binwidth 4")

g4 <- ggplot(running_data, aes(x=activityMins)) + 
    geom_histogram(aes(y=..count..),      # Histogram with density instead of count on y-axis
                   binwidth=6,
                   colour="black", fill="white") +
    geom_density(aes(y=..count..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Count") +
    xlab("Activity Minutes") +
    ggtitle("Activity Minutes overlayed with density estimate : Binwidth 6")

multiplot(g1, g2, g3, g4, cols=2)

Analysis of the above visualization:

In the plot above, we see a distribution of Activity Minutes on the x-axis along with the count of those activities on the y-axis.
The Distribution is clearly multi-modal skewed towards the right. Varying the binwidths also led to similar distributions showing the underlying pattern of the distribution.
Reasons for skewness include the practice runs while I was training for my first full Marathon in 2016, the full marathon and also instances when I forgot to stop the timer while recording activities.
Activity Minutes data obtained was mostly clean, except for instances when I forgot to turn of the fitness tracker resulting into longer workouts, but they were cleaned during the data cleaning process.

b. Distribution of Activity Minutes across various Activities

In [43]:

%%R -i running_data -w 900 -h 480 -u px

require("ggplot2")
require("viridis")

g1 <- ggplot(running_data, aes(x=activityMins, fill=activityType)) + 
    geom_density(alpha=0.5, adjust = 1, na.rm = TRUE) +   # Overlay with transparent density plot
    ylab("Density") +
    xlab("Activity Minutes") + scale_fill_viridis(discrete=TRUE) +
    ggtitle("Distribution of Activity Minutes for the different Activity Types")
g1

Analysis of the above visualization:

In the plot above, we see a distribution of Activity Minutes on the x-axis along with the density of those activities on the y-axis for all the activity Types.
The Distributions are nearly normal or skewed towards the right, which is a nice observation and helps me understand the usual minutes I spend doing a particular activity type.

c. Analysis of Outliers : Activity Minutes for the activities

In [44]:

%%R -i running_data -w 900 -h 480 -u px

require("ggplot2")

fill <- "#4271AE"
line <- "#1F3552"

ggplot(running_data, aes(x=factor(1), y=activityMins)) + geom_boxplot(fill = fill, colour = line, notch = TRUE, width=0.2) +
    guides(fill=FALSE) + coord_flip() + 
    theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) +
    ggtitle("Distribution of Activity Minutes along all the activities") + facet_wrap( ~ activityType)

Analysis of the above visualization:

In the plot above, we want to analyze the outliers for each Activity Type of the total Activity Minutes.
The outliers in the right direction bring a smile to my face, as possibly those are the days I pushed myself and I don't want to comment on the outliers on the left side. The boxplots do help me understand the usual time zones doing a particular activity clearly.

ii. Analysis of the feature - Activity Month¶

a. How many total activities have I done in every month for the past 5 years?

In [45]:

running_data_gMonth = running_data.groupby("activityMonthName").count()
running_data_gMonth = running_data_gMonth[["activityMonth"]]
running_data_gMonth.reset_index(inplace=True)
running_data_gMonth.rename(columns={'activityMonth':'Count'}, inplace=True)
# running_data_gMonth.head()

In [46]:

%%R -i running_data_gMonth -w 800 -h 480 -u px

require("ggplot2")
require("viridis")

orderedclasses <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
running_data_gMonth$activityMonthName <- factor(running_data_gMonth$activityMonthName, levels = rev(orderedclasses))

ggplot(running_data_gMonth, aes(x = activityMonthName, y = Count)) + 
  geom_col(fill = "#1f77b4") + 
  coord_flip() +
  theme(legend.position="none") +
  xlab("Month") +
  ylab("Number of Workout Sessions") + 
  ggtitle("Counts of the workouts in Specific Months over the last 5 years")

Analysis of the above visualization:

In the plot above, I am showing the total number of workouts done in every month.
Okay, I will admit, this was an eye opener for me. I thought I have been regular to the gym, but the plot shows something different. The difference between the numbers can also be accounted for not tracking the specific workout sessions other than running for the years 2012-2014.
In the last 2 years, I have been tracking all my running workouts and the gym sessions and am happy that for some months atleast the number of workouts in the last 2-3 years did touch values greater 50, i.e., March, October and December specifically - probably the onset of Spring Season and the running season.
June has been the most dormant month for me, probably because of the extreme season(heat) in India during that time.
During the visualization purpose, the ordering of factors was required to show the result in increasing order of months.

b. How many specific activities have I done in every month for the past 5 years?

In [47]:

running_data["activityYear"] = running_data.startTime.dt.year

In [48]:

running_data_gMonth = running_data.groupby(["activityYear", "activityMonthName", "activityType"]).count()

In [49]:

running_data_gMonth = running_data_gMonth[["activityMonth"]]
running_data_gMonth.reset_index(inplace=True)
running_data_gMonth.rename(columns={'activityMonth':'Count'}, inplace=True)

In [50]:

%%R -i running_data_gMonth -w 900 -h 480 -u px

require("ggplot2")
require("viridis")

orderedclasses <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
running_data_gMonth$activityMonthName <- factor(running_data_gMonth$activityMonthName, levels = orderedclasses)

ggplot(running_data_gMonth, aes(x = activityMonthName, y = Count, fill = activityType)) + 
  geom_col(position="dodge") + 
  xlab("Month") +
  ylab("Number of Workout Sessions") +
  ggtitle("Counts of the workouts in Specific Months from 2012-2017") + scale_fill_viridis(discrete=TRUE)

Analysis of the above visualization:

In the plot above, I am showing the total number of workouts done in every month for every activity type.
Some interesting facts come out of this analysis:
- I have not tracked a strength training session for the Months of June and July, which makes it clear that I bought a device with that capability in 2016 around July and started capturing the strength sessions after that.
- Cardio sessions are high during the summer season as compared to August - December which is the running season.
- Cycling sessions are few and indicate to citibike rides in NY after coming here in August 2016.

c. Faceted by Year of the activity

In [51]:

%%R -i running_data_gMonth -w 1000 -h 800 -u px

require("ggplot2")
require("viridis")

orderedclasses <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
running_data_gMonth$activityMonthName <- factor(running_data_gMonth$activityMonthName, levels = orderedclasses)

ggplot(running_data_gMonth, aes(x = activityMonthName, y = Count, fill = activityType)) + 
  geom_col(position="dodge") + facet_wrap(~ activityYear) +
  xlab("Month") + 
  ylab("Number of Workout Sessions") +
  ggtitle("Counts of the workouts in Specific Months over the last 4 years") + scale_fill_viridis(discrete=TRUE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Analysis of the above visualization:

In the plot above, I am showing the total number of workouts done in every month for every activity type faceted by Years.
I will write about every year and see how the journey has evolved,
- 2012 and 2013 - Got into fitness tracking with a very basic first generation Garmin Vivofit and tracked a few runs and cycle session. Still was not much into tracking or realized importance of tracking the workouts.
- 2014 - I started working at a company in Gurgaon, India and became a member of the running group Gurgaon Road Runners and that is how I got into fitness, and used to track all my runs with them. I did my first Delhi Half Marathon in December 2014 and tracked all my practice runs before it.
- 2015 - The year I got into a rhythm of running and understood the idea of doing a few half marathons, then resting body for a few months before preparing for another marathon. I was tracking mostly my running sessions during the first half of the year but then I bought a better Garmin device with the ability to track my workouts also and I started capturing my strength training and cardio sessions from the second half of 2015.
- 2016 - I was regular to gym, yoga and running and ran my first full marathon in January 2017. Also, tbe number of running sessions reduced significantly when I moved to NY in August 2016, but the strength training sessions increased.
- 2017 - I have actually observed a decrease in running but tried and kept pace with the Strength training sessions which can be owed to a tough semester in Graduate School aided by extreme climate conditions.
For faceted visualizations, the ordering always needs to be taken care of. Also, the patterns and trends were observed using horizontal bars as well as vertical bars and vertical bars made more sense for this case.

iii. Analysis of the feature - Activity Day¶

A summary for both the visualizations made for the feature analysis of Activity Day has been provided in the end.

a. Which activities on which day of the week?

In [52]:

running_data_activDay = running_data.groupby(["activityDayName", "activityType"])[['activityDay']].count()
running_data_activDay.rename(columns={'activityDay':'Count'}, inplace=True)
running_data_activDay.reset_index(inplace=True)
#running_data_activDay

In [53]:

%%R -i running_data_activDay -w 900 -h 480 -u px

library(tidyverse)
library(ggplot2)
library(gridExtra)
require("viridis")

orderedclasses <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
running_data_activDay$activityDayName <- factor(running_data_activDay$activityDayName, levels = orderedclasses)


g1 <- ggplot(running_data_activDay, aes(x = activityDayName, y = Count, fill=activityType)) +
  geom_col(position = "dodge") +  scale_fill_viridis(discrete=TRUE) + 
  xlab("Day of the week") +
  ylab("Frequency of Workouts") +
  ggtitle("For every specific activity type, what has been the frequency on specific days of the week for years 2012-17?")
g1

b. Which activities on which day of the week, faceted by activity?

In [54]:

%%R -i running_data_activDay -w 900 -h 480 -u px

orderedclasses <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
running_data_activDay$activityDayName <- factor(running_data_activDay$activityDayName, levels = orderedclasses)

g2 <- ggplot(running_data_activDay, aes(x = activityDayName, y = Count, fill=activityType)) +
  geom_col() + facet_wrap(~activityType) + scale_fill_viridis(discrete=TRUE) +
  xlab("Day of the week") +
  ylab("Frequency of Workouts") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("For every specific activity type, what has been the frequency on specific days of the week for 2012-17?")
g2

Analysis of the above visualizations:

In the plots above, I am showing the total number of workouts done on a specific day of the week over the last 5 years of tracked workouts.
The obvious patterns from the plots are as follows:
- The strength training sessions are evenly distributed over all the days of the week except for Sundays (rest day/running day).
- The running count increases as I go into the week with the highest running frequency on the weekends.
- The number of cardio and treadmill running sessions are few as compared to the other workouts but they are mostly evenly distributed throughout the week.
- Cycling sessions have been tracked only since I moved to NY in September 2016, and are few and mostly during the end of the week (Wednesday - Sunday).
Both the visualizations shown are similar to each other, but using the dodge position in the first visualization helps make the comparison amongst activities clear and using faceted diagrams in the second case help make the distinctions clear.

iv. Analysis of the feature - Calories Burnt¶

a. Distribution Histogram with overlayed Density Plots and also with varying binwidths

In [55]:

%%R -i running_data -w 900 -h 480 -u px

require("ggplot2")

ggplot(running_data, aes(x=calories)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=10,
                   colour="black", fill="white") +
    geom_density(aes(y=..density..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Density") +
    xlab("Calories Burnt") +
    ggtitle("Calories Burnt overlayed with density estimate")

In [56]:

%%R -i running_data -w 900 -h 480 -u px

require("ggplot2")

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

g1 <- ggplot(running_data, aes(x=calories)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=10,
                   colour="black", fill="white") +
    geom_density(aes(y=..density..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Density") +
    xlab("Calories Burnt") +
    ggtitle("Calories Burnt overlayed with density estimate")

g2 <- ggplot(running_data, aes(x=calories)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=15,
                   colour="black", fill="white") +
    geom_density(aes(y=..density..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Density") +
    xlab("Calories Burnt") +
    ggtitle("Calories Burnt overlayed with density estimate")

g3 <- ggplot(running_data, aes(x=calories)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=20,
                   colour="black", fill="white") +
    geom_density(aes(y=..density..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Density") +
    xlab("Calories Burnt") +
    ggtitle("Calories Burnt overlayed with density estimate")

g4 <- ggplot(running_data, aes(x=calories)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=25,
                   colour="black", fill="white") +
    geom_density(aes(y=..density..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Density") +
    xlab("Calories Burnt") +
    ggtitle("Calories Burnt overlayed with density estimate")

multiplot(g1, g2, g3, g4, cols=2)

Analysis of the above visualizations:

In the plot above, we see a distribution of Calories burnt on the x-axis along with their frequency and densities plotted.
The Distribution is clearly normal and multi-modal skewed towards the right, which can be understood as the practice runs while I was training for my first full Marathon in 2016 had the highest amount of calories burnt, with the marathon having more than 3000 calories burnt in a workout.
Also, the normal workout session involved around 400-500 calories, which has the highest count and am happy to see a lot of workouts also in the range of 1000 and above calories burnt.
Varying the binwidth also shows similar underlying distribution for the calories burnt.

b. Distribution of Calories Burnt across various Activities

In [57]:

%%R -i running_data -w 900 -h 480 -u px

require("ggplot2")
require("viridis")

g1 <- ggplot(running_data, aes(x=calories, fill=activityType)) + 
    geom_density(alpha=0.4, adjust = 1, na.rm = TRUE) +   # Overlay with transparent density plot
    ylab("Density") + scale_fill_viridis(discrete=TRUE) +
    xlab("Calories Burnt") +
    ggtitle("Calories Burnt overlayed with density estimate")
g1

Analysis of the above visualization:

In the plot above, we see a distribution of Calories Burnt on the x-axis along with the density of those activities on the y-axis for all the activity Types.
The Distributions are nearly normal or skewed towards the right, which is a nice observation and helps me understand the usual calories burnt doing a particular activity.

c. Outlier Analysis of Calories Burnt across various Activities

In [58]:

%%R -i running_data -w 900 -h 480 -u px

require("ggplot2")

fill <- "#4271AE"
line <- "#1F3552"

ggplot(running_data, aes(x=factor(1), y=calories)) + geom_boxplot(fill = fill, colour = line, notch = TRUE, width=0.2) +
    guides(fill=FALSE) + coord_flip() + 
    theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) +
    ggtitle("Distribution of Calories Burnt across all the activities") + facet_wrap( ~ activityType)

Analysis of the above visualization:

In the plot above, we want to analyze the outliers for each Activity Type of the total Calories Burnt.
The outliers in the right direction bring a smile to my face, as possibly those are the days I pushed myself and I don't want to comment on the outliers on the left side. The boxplots do help me understand the usual calories doing a particular activity clearly.
Some of the observations which are worth highlighting are as follows:
- Strength Training sessions had no outliers and had a maximum of around 1800 calories burnt in a session.
- Treadmill Running, even though are small in number but usually for a session did lead to around 500 calores burnt.
- Running had the outliers in the right directions, which are all the practice runs leading upto the Full Marathon as well as the full Marathon.
- Cycling sessions were mostly around 30 minutes in length, duration of Citibike rental and led to about similar calories burn and vary within a small range without significant outliers.
Boxplots are an obvious choice to do outlier detection and uncover patterns about the underlying feature distributions.

v. Calories Burnt trend for all the activities from 2012-17 (Interactive D3.js visualization)¶

The result for this visualization is also available at Visualization Link (Inspired by learning from the course Storytelling with Data) Code

In [ ]:

HTML('')

Analysis of the above visualization:

In the plot above, we see all the activities plotted as separate bar plots in decreasing order of time range from 2017 to 2012.
The highest calories burnt can easily be seen for the events involving running, with around a 1000 calories burnt in every session.
Since moving to New York, that is the inital part of the plot, the focus has shifted from running to mostly strength training sessions and some Citibike rides because of yearly membership.
The fitness tracking journey started with mostly tracking the runs (rightmost part of the chart) and slowly with the improved fitness trackers bought with it, I started capturing more data about myself.
The reduced calories burnt in the recent times(inital part of the chart) can be owed to me facing sub zero temperatures for the first time(moving to NY from India) as well as the hectic schedule of graduate school.
All over, there is a nice trend of equally distributed strength training sessions, cardio sessions and running sessions that have been tracked through the years by me.
The challenge with the visualization was going back and working through D3.js and figuring out how to use the interactions. The interactions with the button and checkboxes have been included with this visualization.

vi. Hour of the day for the activities and how it has changed over the past 5 years (Interactive D3.js visualization)¶

The result for this visualization is also available at Visualization Link (Inspired by learning from the course Storytelling with Data) Code

In [ ]:

HTML('')

Analysis of the above visualization:

In the plot above, the activities have been plotted as a scatter plot in increasing order of time range from 2012 to 2017. The time range can be selected by moving the time range slider. The x-axis is just an increasing time scale that has been shown using the Date Range Slider. The y-axis shows the time of the day (0-24 hours) and the activities are plotted as points on the graph.
Most of the activity sessions for the past 5 years have been in the evening, which clearly show that I am not really a morning person. Specific activity sessions can be selected using their buttons and the date ranger can be moved to focus on the sessions done during that specific interval of time.
Most of the morning sessions mostly include running, which were the practice runs while training for marathons.
Since moving to New York, the rightmost part of the cumulative graph, I have mostly been doing the strength sessions late night at around 22:00 hour and some really late night cycle rides along the riverside.
The fitness tracking journey started with mostly tracking the runs (leftmost part of the chart) and slowly with the improved fitness trackers bought, I started capturing more data about myself. There is a decent distribution of data that has been captured in the last 5 years using various fitness trackers.
And looking at this visualization, I also realize that I have been missing the feel good factor after my long runs, and need to get back to running soon.

vii. Trend : Activity minutes per month for years 2012-2016 (Interactive D3.js visualization)¶

The result for this visualization is also available at Visualization Link (Inspired by learning from the course Storytelling with Data) Code

In [61]:

activity_mins_agg = running_data.groupby(["activityYear", "activityMonth"]).activityMins.agg(["sum"])
activity_mins_agg.to_csv("vis3.csv")

In [ ]:

HTML('')

Analysis of the above visualization:

In the plot above, the total aggregated minutes of the activity types have been plotted on the y-axis with its variation across the months for the years 2012-2016.
The tracking for the workout sessions began in 2012, with about 200 minutes of activity captured during the month of October, but was still not much into tracking at that time.
During the year 2013, I started initially tracking my runs and most of the activity minutes for the months in 2013 are due to the runs that have been captured.
During the year 2014, I was still only capturing my runs for the year and had a decent running time during most of the months from March - May, which reduced because of excess heat during June and July and then the I started training for a half marathon, due to which the activity minutes picked up during the end of 2014.
During the year 2015, I just ran a half marathon for the month of January and after that the cool down period started. Because of the advanced fitness trackers bought during 2015, from June I started capturing my strength training sessions along with the running sessions and along with it, I started preparing for a half marathon in December and full marathon in January, because of which there has been a high value of activity minutes recorded for the last half of 2015 season.
For the year 2016, I was running regularly and also training regularly for my runs in the gym, because of which there has been such high activity recorded for the year 2016. I also ran my first full marathon in January 2016 and therefore have high activity recorded for the month of January 2016. The reduction in activity minutes as 2016 progressed can be aided to the joining of Graduate school in NY and recording mostly my strength training sessions.
Both straight connecting lines as well as the curve montonex were tried in D3.js, but using curved lines help better understand the trend rather than lines with sharp edges.

viii. Trend : Running Mileage (kms) per month for years 2014-2016 (Interactive D3.js visualization)¶

The result for this visualization is also available at Visualization Link (Inspired by learning from the course Storytelling with Data) Code

In [63]:

running_data_km = running_data.loc[(running_data["activityType"] == "Running") | \
                              (running_data["activityType"] == "Treadmill Running") ]
running_data_kmagg = running_data_km.groupby(["activityYear", "activityMonth"]).actDistance.agg(["sum"])
running_data_kmagg.to_csv("vis4.csv")

In [ ]:

HTML('')

Analysis of the above visualization:

In the plot above, the running mileage has been plotted for every month of the years 2014-2016. I am only showing the data for these years as I was mostly capturing my running activities during these years.
For the year 2014, the running time has mostly varied with the running season in India. The running season starts around August and finishes with the Delhi Marathon and Mumbai Marathon in Decemeber and January respectively, showing higher kilometers run during those times of the year.
For the year 2015, after the Mumbai Marathon in January is when you rest and condition your body for the next running season. Even though in the previous plot I see that the recorded activity time is very high for the months in 2015, the actual running miles is less till the beginning of the running season in August-September. As I was preparing for my first full marathon in January 2016, the higher kilometer runs for the month of November and December 2015 are the due to the practice long runs for its preparation.
For the year 2016, the running kilometers are high for the year January when I was doing the practice runs and also ran my first full marathon. After that, I was just running about a long run once a week to keep conditioning along with the rest time after the season. In July-August, I was preparing to move to NY to join graduate school and during the initial months of moving to NY, I was still logging those kilometers due to the vicinity of Central Park near my house but as the semester proceeded and the workload increased along with extreme temperature, the running kilometers reduced significantly to even less than 10km during the previous years peak running months, i.e., November and Decemeber.
Curved lines help better understand the underlying trends as compared to connected lines with sharp edges.

ix. Total minutes spent on each activity type during the months in year 2012-17¶

In [65]:

year_month_act_mins = running_data.groupby(["activityYear", "activityMonthName", "activityType"]).activityMins.agg(["sum"])
year_month_act_mins.reset_index(inplace=True)
#year_month_act_mins

In [66]:

%%R -i year_month_act_mins -w 980 -h 1080 -u px

require("ggplot2")
require("viridis")

orderedclasses <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
year_month_act_mins$activityMonthName <- factor(year_month_act_mins$activityMonthName, levels = rev(orderedclasses))

g1 <- ggplot(year_month_act_mins, aes(x=sum, y=activityMonthName)) + 
    geom_point(aes(color=activityType, shape = activityType)) +
    ylab("Months") + scale_fill_viridis(discrete=TRUE) +
    xlab("Total Activity Minutes") +
    ggtitle("Total Minutes for Activity Types faceted by year and months ") + 
    facet_grid(activityYear ~ activityType)
g1

Analysis of the above visualization:

In a single cleveland dot plot, we can visualize the total activity minutes (x-axis) that I spent in the years 2012-2017 for the months.
From the visualization, it is clear that running has been the most tracked activity during all the years followed by strength training, which has been tracked frequently from 2015 end.
The activity minutes spent on an average during the activities are easily seen.
As I am using a facet grid for the visualization, it was difficult for me to reorder it based on the total activity minutes for specific months in every year, as fields are shown both on the x-axis as well as the y-axis.
The quality of data comes into question as I have mostly been doing a few activities in a year and shifting to other activities during another year. Even though this leads to large empty spaces in the visualization, if heatmap was used will always show up as empty blocks in those regions, which is why cleveland dot plots are better for such a visualizations with sparse entries.
While thinking about the visualization that would be able to uncover patterns for this analysis, I had the options to choose from heatmaps and cleveland dot plots, but for faceting the data based on various fields and showing patterns about a particular analysis, cleveland dot plots do work better.

x. Analyzing average calories burnt for every activity type during the months of years 2012-17¶

In [67]:

year_month_cal_mins = running_data.groupby(["activityYear", "activityMonthName", "activityType"]).calories.agg(["mean"])
year_month_cal_mins.reset_index(inplace=True)
#year_month_cal_mins

In [68]:

%%R -i year_month_cal_mins -w 980 -h 1080 -u px

require("ggplot2")
require("viridis")

orderedclasses <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
year_month_cal_mins$activityMonthName <- factor(year_month_act_mins$activityMonthName, levels = rev(orderedclasses))

g1 <- ggplot(year_month_cal_mins, aes(x=mean, y=activityMonthName)) + 
    geom_point(aes(color=activityType, shape = activityType)) +
    ylab("Months") + scale_fill_viridis(discrete=TRUE) +
    xlab("Average Calories Burnt") +
    ggtitle("Average Calories burnt for Activity Types faceted by year and months ") + 
    facet_grid(activityYear ~ activityType)
g1

Analysis of the above visualization:

In a single cleveland dot plot, we can visualize the average calories burnt (x-axis) during the years 2012-2017 for the months.
From the visualization, it is clear that running has been the most tracked activity during all the years followed by strength training, which has been tracked frequently from 2015 end. Also, number of calories burnt vary in the range of 500-700 for the running and strength training activities but vary a lot for my cardio sessions.
The average calories burnt are usually high for the running sessions and strength traning sessions as compared to cycling sessions or treadmill running sessions(because I don't run on the treadmill longer).
As I am using a facet grid for the visualization, it was difficult for me to reorder it based on the average calories burnt for specific months in every year, as fields are shown both on the x-axis as well as the y-axis.

xi. How have my activity levels (based on average Heart Rate) varied along the chosen aggregations?¶

a. Activity Levels and their variation along the aggregated days of the week

In [69]:

running_data_rec = running_data.loc[running_data["activityLevel"] != "Not Recorded", :]

In [70]:

%%R -i running_data_rec -w 980 -h 680 -u px


library(vcd)
library(RColorBrewer)

mycolors <- brewer_pal(type = "seq")(4)

orderedclasses <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
running_data_rec$activityDayName <- factor(running_data_rec$activityDayName, levels = orderedclasses)

vcd::mosaic(activityLevel ~ activityDayName, data = running_data_rec, gp = gpar(fill = mycolors), direction = c("v", "h"),
            main = "Activity Levels and their variation along the aggregated days of the week")

Analysis of the above visualization:

In the mosaic plot above, the activity levels have been categorized into Very Light, Light, Moderate, Hard and Very Hard. The categorization is done based on the average heart rate according to:
- Average HR < 114 : Very Light
- Average HR (>= 114 and < 133) : Light
- Average HR (>= 133 and < 152) : Moderate
- Average HR (>= 152 and < 171) : Hard
- Average HR >= 171 : Very Hard
According to the plot, Tuesdays and Sunday contribute to workout with hard activity levels, mainly the leg days in the gym on Tuesday and long running sessions on Sundays.
Most of the training sessions are Light for all the days, which owes to the strength training sessions where there are proper rest after a set of exercises.
Most of the workouts on Mondays are either Very Light or Light sessions, as it was a recovery day for the long runs done on Sundays. Similar can be said about Saturdays, which are mostly rest days or light activity days in preparation for the long run on Sundays.

b. Activity Levels and their variation along the aggregated activities

In [71]:

%%R -i running_data_rec -w 980 -h 680 -u px


library(vcd)
library(RColorBrewer)

mycolors <- brewer_pal(type = "seq")(4)

orderedclasses <- c("Running", "Cycling", "Cardio", "Treadmill Running", "Strength Training")
running_data_rec$activityType <- factor(running_data_rec$activityType, levels = orderedclasses)

vcd::mosaic(activityLevel ~ activityType, data = running_data_rec, gp = gpar(fill = mycolors), direction = c("v", "h", "v"),
            main = "Activity Levels and their variation along the aggregated activities")

Analysis of the above visualization:

In the mosaic plot above, the activity levels have been categorized into Very Light, Light, Moderate, Hard and Very Hard. The categorization is done based on the average heart rate according to:
- Average HR < 114 : Very Light
- Average HR (>= 114 and < 133) : Light
- Average HR (>= 133 and < 152) : Moderate
- Average HR (>= 152 and < 171) : Hard
- Average HR >= 171 : Very Hard
According to the plot, Running, Cycling and Treadmill Running are the activities where my activity level reach hard zone.
In none of the activities, the fitness tracker recorded very hard zones, which makes me happy.
In running and treadmill running, there are very few activities which are marked as Very Lights. Most of these activites do increase the heart rate faster.
Strength training sessions are usually marked as Light to Very Light owing the the fact that after every set, there is a rest period before another set and your heart rate keeps fluctuating.

VI. Conclusion¶

The main conclusions from the analysis of the data are as follows:

This analysis was made extra special as I was analyzing my captured data which helped me relive the past about my last 5 years.
The analysis actually helped me realize how much change in my lifestyle has happened from the time I joined graduate school. It just has been a huge change from more active to sedentary lifestyle, but the analysis does inspire me to be more fit and get more active.
Even though the analysis for the workouts here was limited by the ability of the fitness trackers used, I do realize that with the improvement in technology and my latest fitness tracker, I would be able to capture proper data about all my future activities and work on more quantified self projects like these.
One of the main limitations of this project also comes from the technical issues caused by Garmin by not providing an API to obtain all the data captured by the device. The issues becomes more clear as we cannot obtain our daily step data, which can be easily correlated with the activity times or heart rate to obtain better insights into which activities were predominant during an active day.
For the future, I hope Garmin makes it easier to obtain all the data captured by the advanced fitness trackers we are using now and we can do modeling and predictive analysis using the data. Also obtains patterns like which specific activity can help with the most calorie burn given a limited amount of time. Also, I will try and obtain some qualitative data about the mood associated with every workout and what sort of a role does that play in the analysis.
The lesson learnt can just be summarized as keep on the path of tracking workouts as any data captured can be put to good use by analyzing it. Also, since moving to NY in August 2016, I do realize that the running frequency has gone down and I need to put my running shoes back on. Also, I did realize while working on this project that how much important are these activity sessions for me and how are they so deeply linked to my happiness.

Quantified Self : Analyzing Personal Garmin Data (2012-2017)

I. Introduction¶

Insights into my fitness journey (2012 - 17)¶

II. Team¶

Programming Language Used :¶

III. Analysis of Data Quality¶

i. Import Statements and Setup¶

ii. Load Data¶

iii. Data Cleaning¶

iv. Exploring the data and the datatypes¶

v. Data Transformations¶

vi. Summary of Continuous Variables and Missing Data Analysis¶

vi. Visualizing Distribution of Categorical Variables¶

vii. Multivariate Distributions for Continuous Data¶

IV. Executive Summary¶

V. Main Data Analysis and Visualizations¶

i. Analysis of the feature - Activity Minutes¶

ii. Analysis of the feature - Activity Month¶

iii. Analysis of the feature - Activity Day¶

iv. Analysis of the feature - Calories Burnt¶

v. Calories Burnt trend for all the activities from 2012-17 (Interactive D3.js visualization)¶

vi. Hour of the day for the activities and how it has changed over the past 5 years (Interactive D3.js visualization)¶

vii. Trend : Activity minutes per month for years 2012-2016 (Interactive D3.js visualization)¶

viii. Trend : Running Mileage (kms) per month for years 2014-2016 (Interactive D3.js visualization)¶

ix. Total minutes spent on each activity type during the months in year 2012-17¶

x. Analyzing average calories burnt for every activity type during the months of years 2012-17¶

xi. How have my activity levels (based on average Heart Rate) varied along the chosen aggregations?¶

VI. Conclusion¶

Comments