I. Introduction

Insights into my fitness journey (2012 - 17)

I have been an avid runner and fitness enthusiast for the past 5 years, and what has been a fun part of all those is how much we can learn about ourselves from exploring the data we capture.

I started capturing data about myself around the year 2012, which was when we saw the advent of wearable devices, and have always wanted to carry out an analysis to gain insights about my running patterns, my strength training schedules, my diet logs and any other metadata which I can collect about myself like the Heart Rate, Cadence for runs and also the step counts.

About Quantified Self : When learning about the ways to explore about the data you capture, I came across the term Quantified Self and have been following the blog ever since. Its just amazing how capturing and analyzing data about yourself can actually help me make data driven decision to improve your lifestyle and so many people are joining this movement and bringing a data-driven change to improve their lives.

Questions which I would like to answer in my project analysis are as follows :

I do take inspiration from the Quantified Self Running archives and will try and do an analysis on all these aspects about my data.

Where the reader can find the data? I would like to share the data captured about all my activities if someone wants to analyze it. The data used for this project is hosted on Github at this link. This code file has also been hosted on Github and can be accessed at this link. You can always mail me to get the most updated data at anujk3@gmail.com .

II. Team

I will be working by myself on my dataset and gaining insights through the analysis I do. The stages of the project have been highlighted below:

  • I had to go to the activities page on Garmin and download around 22 pages of data one by one which had all the data in a csv format. Sadly, Garmin does not offer an API like fitbit which makes it difficult to gather any data for analysis.
  • After getting the data from Garmin, the next part was combining all the csv files together and converting it into a dataframe which can be analyzed further. All the processing was done in python using pandas.
  • The initial processing involved cleaning the data, remove extraneous activities like walking, swimming etc. which I haven't tracked a lot, correcting the data errors, correcting datetime fields for the ease of datetime analysis.
  • After initial cleaning of data, I was focussed on data transformations to add dervied columns to the data which can be useful to exploratory data analysis.
  • The next part was focussed on doing an analysis of missing values followed by the existing continuous and discrete variables in the dataset which involed an analysis of the missing values along with an analysis of important univariate distributions of continuous columns and multivariate analysis of continuous variables.
  • The next part was to figure out the important questions which I can answer using the dataset and work on specific problems to come with relevant visualizations for explaning them.

Programming Language Used :

Something I am really passionate about is doing data analysis work and Python has been my go to language for it. Through the course Exploratory Data Analysis and Visualizations, Prof. Joyce Robbins did help me realize how even R can help in quickly analyzing the data and actually the best part about R for me was getting to learn about ggplot.

I still prefer using Python as my language for data analysis but hacked my way to use ggplot within the Python Analysis and will be doing most of my dataframe analysis in Python and making plots with ggplot, but all within a Python Notebook.

I will also be working on interactive visualizations using D3.js and whenever possible matplotlib within Python along with seaborn.

III. Analysis of Data Quality

i. Import Statements and Setup

In [1]:
import pandas as pd
import math
import warnings
import numpy as np
from rpy2.robjects import pandas2ri
pandas2ri.activate()
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = (20,20)
%reload_ext rpy2.ipython
%matplotlib inline
pd.options.display.max_rows = 999
sns.set_style("white")
In [1]:
from IPython.core.display import HTML
HTML("""

""")
Out[1]:

ii. Load Data

In [3]:
path = "data/a"

frame = pd.DataFrame()
list_ = []

for i in range(1, 27):
    file_path = path + str(i)+ ".csv"
    #print(file_path)
    df = pd.read_csv(file_path, index_col=None, header=0, skiprows=2)
    list_.append(df)
    frame = pd.concat(list_)
    
frame.reset_index(inplace=True)
frame.drop("index", axis=1, inplace=True)
frame.drop(" Favorite", axis=1, inplace=True)
frame.drop("Unnamed: 11", axis=1, inplace=True)
In [4]:
#frame.columns
In [5]:
col_names = ['activityType', 'startTime', 'activityTime', 'actDistance', 'elevationGain', 'avgSpeed', 'avgHR', 'maxHR', 
            'steps', 'calories']
frame.columns = col_names

iii. Data Cleaning

Data Cleaning has been performed in the following steps to ease the process for creating the visualizations later.

Ignore the swimming activity, only 1 row

In [6]:
frame.drop(450, axis=0, inplace=True)

Convert Elliptical to Cardio

In [7]:
frame[frame["activityType"] == "Elliptical"]
frame.loc[384, "activityType"] = "Cardio"

Ignore Walking as an activity

In [8]:
frame.drop(frame[frame["activityType"] == "Walking"].index, axis=0, inplace=True)

Ignore swimming as an activity as only 3 recorded activities

In [9]:
frame[frame["activityType"] == "Lap Swimming"].index
frame.drop(frame[frame["activityType"] == "Lap Swimming"].index, axis=0, inplace=True)

Activity type Strength Training was recording as Others in older Garmin Devices

In [10]:
frame.drop([214, 221, 223, 242], axis=0, inplace=True)
frame.activityType = frame.activityType.str.replace("Other", "Strength Training")

Activity Types Indoor Rowing and Indoor Cycling also included as Cardio

In [11]:
frame.activityType = frame.activityType.str.replace("Indoor Rowing", "Cardio")
frame.activityType = frame.activityType.str.replace("Indoor Cycling", "Cardio")

Correcting Data Errors and Data Types

In [12]:
frame.loc[frame["activityType"] == "Strength Training", "actDistance"] = np.NaN
frame.loc[frame["activityType"] == "Strength Training", "elevationGain"] = np.NaN
frame.loc[frame["activityType"] == "Strength Training", "avgSpeed"] = np.NaN
frame.loc[frame["activityType"] == "Strength Training", "steps"] = np.NaN
frame.loc[frame["activityType"] == "Cardio", "actDistance"] = np.NaN
frame.loc[frame["activityType"] == "Cardio", "elevationGain"] = np.NaN
frame.loc[frame["activityType"] == "Cardio", "avgSpeed"] = np.NaN
frame.loc[frame["activityType"] == "Cardio", "steps"] = np.NaN
In [13]:
frame["avgHR"] = frame["avgHR"].replace(['--', '0'], np.NaN)
frame["maxHR"] = frame["maxHR"].replace(['--', '0'], np.NaN)
frame["steps"] = frame["steps"].replace(['--', '0'], np.NaN)
frame["steps"] = frame.steps.str.replace(",", "")
frame['avgHR'] = frame.avgHR.astype(float)
frame['maxHR'] = frame.maxHR.astype(float)
frame['steps'] = frame.steps.astype(float)
In [14]:
frame["elevationGain"] = frame["elevationGain"].replace(['--', '0', 0], np.NaN)
frame.loc[[456, 495], "elevationGain"] = 328
frame["elevationGain"] = frame.elevationGain.str.replace(",", "")
frame['elevationGain'] = frame.elevationGain.astype(float)
frame["avgSpeed"] = frame.avgSpeed.str.replace(":", ".")
frame["avgSpeed"] = frame.avgSpeed.str.replace("--.--", '0.0')
frame['avgSpeed'] = frame.avgSpeed.astype(float)
frame['actDistance'] = frame.actDistance.astype(float)
frame["calories"] = frame.calories.apply(lambda x: str(x).replace(",", ""))
frame['calories'] = frame.calories.astype(float)
In [15]:
frame.drop(frame[frame["actDistance"] < 0.01].index, axis=0, inplace=True)
frame.reset_index(inplace=True)
frame.drop("index", axis=1, inplace=True)
In [16]:
running_data = frame.copy()
running_data.to_csv("vis1.csv")

iv. Exploring the data and the datatypes

Initial Peek into the data

In [17]:
running_data.head()
Out[17]:
activityType startTime activityTime actDistance elevationGain avgSpeed avgHR maxHR steps calories
0 Strength Training Thu, 6 Apr 2017 10:43 PM 1:09:02 NaN NaN NaN 119.0 163.0 NaN 597.0
1 Strength Training Tue, 4 Apr 2017 10:36 PM 54:37 NaN NaN NaN 117.0 155.0 NaN 474.0
2 Strength Training Sun, 2 Apr 2017 10:46 PM 1:12:20 NaN NaN NaN 116.0 159.0 NaN 601.0
3 Strength Training Tue, 28 Mar 2017 11:02 PM 42:11 NaN NaN NaN 99.0 142.0 NaN 257.0
4 Strength Training Mon, 27 Mar 2017 11:17 PM 45:42 NaN NaN NaN 120.0 165.0 NaN 412.0

v. Data Transformations

Creating Derived Fields that can be used for the visualizations. The transformations have been highlighted.

Converting startTime to Python DateTime Format to use the inbuilt functions

In [18]:
running_data["startTime"] = pd.to_datetime(running_data["startTime"])
#running_data.head()

Concatenating information about the Month and Year for every activity to the DataFrame

In [19]:
running_data["activityMonth"] = running_data.startTime.dt.month
running_data["activityDay"] = running_data.startTime.dt.dayofweek
In [20]:
running_data["activityMonthName"] = running_data.activityMonth.map({1:"January", 2:"February", 3:"March", 4:"April",
                                                                   5:"May", 6:"June", 7:"July", 8:"August", 
                                                                   9:"September", 10:"October", 11:"November", 12:"December"})
In [21]:
running_data["activityDayName"] = running_data.activityDay.map({0:"Monday", 1:"Tuesday", 2:"Wednesday", 3:"Thursday",
                                                                   4:"Friday", 5:"Saturday", 6:"Sunday"})

Obtaining Activitiy Levels using the average HR field for all the workouts

In [22]:
## To-Do : Divide the Heart Rate Zones from the data into Zones and create a new column
# < 114 - Very Light
# 114-133 - Light
# 133-152 - Moderate
# 152-171 - Hard
# >171 - Very Hard
## do using lambda: apply a function

def get_hr_zones(avgHR):
    if math.isnan(avgHR):
        return "Not Recorded"
    if avgHR < 114:
        return "Very Light"
    elif avgHR >= 114 and avgHR < 133:
        return "Light"
    elif avgHR >= 133 and avgHR < 152:
        return "Moderate"
    elif avgHR >= 152 and avgHR < 171:
        return "Hard"
    else:
        return "Very Hard"
In [23]:
running_data["activityLevel"] = running_data.avgHR.apply(get_hr_zones)

Calculating total minutes of an activity from the data

In [24]:
def getMinutes(activityTime):
    curr_time = activityTime.split(":")
    if len(curr_time) == 2:
        final_time = curr_time[0] + "." + curr_time[1]
    else:
#         print curr_time
        mins = int(curr_time[0])*60 + int(curr_time[1])
        final_time = str(mins) + "." + curr_time[2]
    return final_time
In [25]:
running_data["activityMins"] = running_data.activityTime.apply(getMinutes)
running_data["activityMins"] = running_data["activityMins"].astype("float")
In [26]:
running_data.drop("activityTime", axis=1, inplace=True)
#running_data.head()
In [27]:
#running_data.head()
running_data.to_csv("vis2.csv")

Checking final data types

In [28]:
running_data.dtypes
Out[28]:
activityType                 object
startTime            datetime64[ns]
actDistance                 float64
elevationGain               float64
avgSpeed                    float64
avgHR                       float64
maxHR                       float64
steps                       float64
calories                    float64
activityMonth                 int64
activityDay                   int64
activityMonthName            object
activityDayName              object
activityLevel                object
activityMins                float64
dtype: object

Creating a list of continuous and categorical variable

In [29]:
categorical_vars = running_data.describe(include=["object"]).columns
continuous_vars = running_data.describe().columns

vi. Summary of Continuous Variables and Missing Data Analysis

The summary of continuous variables helps us figure out the count of values for all the variables where the values are present. It helps us explore the number of missing values for the variables and begin the analysis of MISSING DATA

In [30]:
running_data.describe()
Out[30]:
actDistance elevationGain avgSpeed avgHR maxHR steps calories activityMonth activityDay activityMins
count 211.000000 80.000000 211.000000 336.000000 336.000000 82.000000 494.000000 494.000000 494.000000 494.000000
mean 5.018436 208.725000 9.970711 123.291667 157.639881 8519.463415 585.524291 6.447368 3.192308 63.260020
std 4.160373 284.234659 3.127073 17.777179 14.814820 8494.509896 455.932193 3.712304 2.070135 42.252763
min 0.240000 3.000000 0.000000 77.000000 81.000000 404.000000 0.000000 1.000000 0.000000 1.270000
25% 2.470000 51.500000 8.580000 111.000000 150.000000 3783.500000 244.500000 3.000000 1.000000 28.437500
50% 3.550000 112.500000 9.470000 121.000000 159.000000 4753.000000 482.500000 7.000000 3.000000 58.560000
75% 6.300000 257.500000 11.005000 135.000000 168.000000 9205.000000 816.500000 10.000000 5.000000 91.367500
max 26.340000 1943.000000 43.370000 168.000000 188.000000 42808.000000 3362.000000 12.000000 6.000000 280.590000

Looking at the count of values for all the continuous variables, we observe that out of 494 activities, 211 activities had distance recorded, 80 activities had elevationGains recorded, the average speed was recorded for 211 activities, with average and max Heart Rate being recorded for 336 activities.

Calories, Activity Month, Activity Day and the Activity Minutes were recorded for all the activities

Some of the reasons for missing data are as follows:

  1. During the span of 5 years, I have changed 4 wearable devices. Initially I used the Garmin Vivofit tracker for all the runs, which did not have heart rate tracker.
  2. After that I bought the Garmin Forerunner 920 (GPS inbuilt), which had a heart rate strap along with it and it also used to calculate the elevationGain and many more parameters for all the runs. But I do realize that I was lazy to use the heart rate strap as it needed to be washed after every run.
  3. After that I used the Garmin Vivofit2 tracker, which had inbuilt heart rate monitor but lacked GPS, so the runs were still being tracked by Garmin Forerunner 920 and whenever were tracked using Vivofit2, lacked heart rate information.
  4. Currently I am using the Garmin Vivoactive HR, which has an wrist HR monitor as well as an inbuilt GPS, but being in Graduate School, the frequency of runs has gone down significantly.

All the variables in the analysis are important but lets do an initial missing data analysis with visualizations

Visualization : Missing Data 1

In [31]:
running_data_continuous = running_data[running_data.describe().columns].copy()
In [32]:
%%R -i running_data_continuous -w 900 -h 480 -u px

library(vcd)
library(dplyr)
library(readr)
library(grid) # for gpar
library(RColorBrewer)
library(scales)
library(knitr)
library(mi)

image(missing_data.frame(running_data_continuous))

rm(list = ls())
NOTE: The following pairs of variables appear to have the same missingness pattern.
 Please verify whether they are in fact logically distinct variables.
     [,1]    [,2]   
[1,] "avgHR" "maxHR"

Visualization : Missing Data 2

In [33]:
%%R -i running_data_continuous -w 900 -h 480 -u px

library(vcd)
library(dplyr)
library(grid) # for gpar
library(RColorBrewer)
library("VIM")
library("mice")
library(lattice)

aggr_plot <- aggr(running_data_continuous, col=c('skyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(running_data_continuous), cex.axis=.7, gap=3, ylab=c("Heatmap of missing data"), combined= TRUE)
rm(list = ls())
 Variables sorted by number of missings: 
      Variable Count
 elevationGain   414
         steps   412
   actDistance   283
      avgSpeed   283
         avgHR   158
         maxHR   158
      calories     0
 activityMonth     0
   activityDay     0
  activityMins     0

Looking at the above visualization, the following patterns emerge:

  1. Average Heart Rate and Maximum Heart Rate values are missing for those specific Activities where I was using a fitness tracker which did not have Heart Rate Tracking ability.
  2. The statistics like Elevation Gain were tracked when I was using a heart rate strap attached, which being lazy I didn't use often, leading to the highest number of missing values, because of which we will not consider it as an important feature in our analysis.
  3. For the first generation fitness tracker used like Garmin Vivofit 1 and 2, even though it had capabilities to record my Running Distance based on the number of steps take, it still did not process the other metrics like Average Speed and Heart rate values.
  4. Also, the Running Distance values are missing for all workouts which involved Strength Training. If there was some recorded value for this field during Strength Training sessions due to the number of steps taken, it has been ignored and replaced with NaNs in my analysis.
  5. Steps were recorded by the most recent fitness trackers used and therefore have a lot of missing values. During activities not involving Running, the Steps have not been considered and replaced with NaNs.

From the above analysis, it can be seen that the most important features for further analysis are Activity Minutes, Activity Month, Activity Day, Calories Burnt, Average Heart Rate, Maximum Heart Rate, Activity Distance Covered. The other features, specifically, Average Speed , Steps and Elevation Gain have a higher number of missing values and will not be considered important for future analysis, even though Average Speed and Steps can help me analyze the runs and find some correlations.

Visualizing Distribution of Continuous Variables

In [34]:
_ = running_data.hist(column=continuous_vars, figsize = (16,16))

vi. Visualizing Distribution of Categorical Variables

In [35]:
print(categorical_vars)
Index(['activityType', 'activityMonthName', 'activityDayName',
       'activityLevel'],
      dtype='object')
In [36]:
# Count plots of categorical variables

fig, axes = plt.subplots(4, 3, figsize=(16, 16))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.7, hspace=0.3)

for i, ax in enumerate(axes.ravel()):
    if i > 3:
        ax.set_visible(False)
        continue
    sns.countplot(y = categorical_vars[i], data=running_data, ax=ax)

vii. Multivariate Distributions for Continuous Data

In [37]:
_ = pd.scatter_matrix(running_data, alpha=0.2, figsize=(30, 30))

Analysis of the above visualization:

A Scatterplot matrices is a great way to roughly determine if you have a linear correlation between multiple variables. This is particularly helpful in pinpointing specific variables that might have similar correlations amongst the dataset. Some observations are as follows:

  • Calories burnt show approximately linear correlation with the minutes of a particular activity and the steps taken.
  • Heteroskedasticity is observed for variables calories with activity Minutes and also activityMins and elevationGain.

IV. Executive Summary

I have been an avid runner and fitness enthusiast for the past 5 years, and have been capturing data about my training activities. One of the main motivations to do this analysis was to gain insights into my running patterns, my strength training schedules and any relevant information I can collect using all the data I have collected from my training sessions.

During the span of five years, I have changed 4 wearable devices. Initially I used the Garmin Vivofit tracker for all the runs, which did not have heart rate tracker. After that I used the Garmin Forerunner 920 (GPS inbuilt), which had a heart rate strap along with it and it also used to calculate the elevationGain and many more parameters for all the runs. The third tracker used was the Garmin Vivofit2 tracker, which had inbuilt heart rate monitor but lacked GPS, so the runs were still being tracked by Garmin Forerunner 920 and whenever were tracked using Vivofit2, lacked heart rate information. Currently I am using the Garmin Vivoactive HR, which has an wrist HR monitor as well as an inbuilt GPS.

In the summary, I will highlight the following main insights from the analysis:

  • Hour of the day I perform the activities and how it has varied in the last 5 years?
  • For every month of the past 5 years, how has the total activity time varied?
  • Also, for the last 5 years, how has the running mileage varied and its comparison to the activity times?

For the analysis about the hour of the day I have performed the activities, I created the following interactive visualization using D3.js: Visualization Link (Inspired by learning from the course Storytelling with Data) Code

In [ ]:
HTML('')

About the Visualization : In this specific visualization created using D3.js, the activities have been plotted as a scatter plot in increasing order of time range from 2012 to 2017. Knowing that the activities span a range of 5 years, I made the decision to not keep tick labels for x-axis but added a Data Range Slider below the x-axis to add interactivity to the plot (a date range can be chosen). The y-axis shows the time of the day (0-24 hours) and the activities are plotted as points on the graph. Also interactive buttons can be used to select any activity, like running, cycling etc. and only its patterns can be observed for a given date range.

The main insights that can be obtained from the visualization are as follows:

  • Most of the activity sessions for the past 5 years have been in the evening, which clearly show that I am not really a morning person.
  • Most of the morning sessions included running, which were the practice runs while training for marathons.
  • Since moving to New York in August 2016, the rightmost part of the graph, I have mostly been doing the strength sessions late night at around 22:00 hour and some really late night cycle rides along the riverside, which can be clearly seen in the visualization. The increase in the cycling sessions after moving to New York can be attributed to the annual Citibike membership and daily usage of it during the Winter vacations after Fall 2016 semester.
  • The fitness tracking journey started with mostly tracking the runs (leftmost part of the chart consist mostly of green dots) and slowly with the improved fitness trackers bought, I started capturing more data about varied activities I did.
  • And looking at this visualization, I also realize that in the recent times, I have been missing the feel good factor after my long runs, and need to get back to running soon. It is clearly visible that I have mostly been focusing on the strength training sessions (rightmost pink clusters) in the recent times.

To understand the total activity time and how it varied for the last 5 years during specific months, I created the following D3 visualization : Visualization Link (Inspired by learning from the course Storytelling with Data) Code

In [ ]:
HTML('')

About the visualization : In this specific visualization created using D3.js, the total aggregated minutes of the activity types have been plotted on the y-axis with its variation across the months for the years 2012-2016. The trend line plots have been specifically annotated for every year in the analysis. Hovering the mouse on the visualization highlights the closest data points and annotates it in the visualization with the minutes spent in that month. For this visualization, I am using curved lines which help us better understand the underlying variations in the data.

The main insights that can be obtained from the visualization are as follows:

  • The tracking for the workout sessions began in 2012, with about 200 minutes of activity captured during the month of October 2012, but I was still not much into tracking at that time.
  • During the year 2013, I started tracking my running activities only and most of the activity minutes for the months in 2013 are due to the runs that have been tracked.
  • During the year 2014, I started my running journey and joined the Gurgaon Road Runners group and got inspired to track my runs more seriously. Even though I was still only capturing my runs for the year, I had a decent running time during most of the months from March - May. The time spent on running activities reduced because of extreme hot temperatures in India during June and July and then the I started training for a half marathon, due to which the activity minutes picked up during the end months of 2014.
  • During the year 2015, I just ran a half marathon for the month of January and after that had a cool down period. Because of the advanced fitness trackers bought during 2015, from June I started capturing my strength training sessions along with the running sessions and along with it, I started preparing for a half marathon in December and full marathon in January, because of which there has been a high value of activity minutes recorded for the last half of 2015 season.
  • For the year 2016, I was running regularly and also training regularly for improving my runs in the gym, because of which there has been such high activity recorded for the year 2016. I also ran my first full marathon in January 2016 and therefore have high activity recorded for the month of January 2016. The reduction in activity minutes as 2016 progressed can be aided to the joining of Graduate school in NY and recording mostly my strength training sessions.

Even though the activity minutes reduced as the year 2016 progressed, I was still capturing all my other training sessions than running and do have decent activity minutes captured. The difference of how graduate school really impacted my running schedule will be clearly visible from the following visualization of just the running kilometers tracked for the years 2014-2016.

And, finally lets look at the running kilometers tracked for every month in the years 2014-2016. The result is shown in the following visualization created using D3.js : Visualization Link (Inspired by learning from the course Storytelling with Data) Code

In [ ]:
HTML('')

About the visualization : In this specific visualization created using D3.js, the running mileage has been plotted for every month of the years 2014-2016 with the running mileage on the y-axis and the months on the x-axis. I am only showing the data for these years as I was mostly capturing all my running activities during these years. The trend line plots have been specifically annotated for every year in the analysis. Hovering the mouse on the visualization highlights the closest data points and annotates it in the visualization with the total kilometers run in that month. For this visualization, I am using curved lines which help us better understand the underlying variations in the data.

The main insights that can be obtained from the visualization are as follows:

  • For the year 2014, the running time has mostly varied with the running season in India. The running season starts around August and finishes with the Delhi Marathon and Mumbai Marathon in Decemeber and January respectively, showing higher kilometers run during those times of the year.
  • For the year 2015, after the Mumbai Marathon in January is when you rest and condition your body for the next running season. Even though in the previous plot I see that the recorded activity time is very high for the months in 2015, the actual running miles is less till the beginning of the running season in August-September. As I was preparing for my first full marathon in January 2016, the higher kilometer runs for the month of November and December 2015 are the due to the practice long runs for its preparation.
  • For the year 2016, the running kilometers are high for the year January when I was doing the practice runs and also ran my first full marathon. After that, I was just running about a long run once a week to keep conditioning along with the rest time after the season. In July-August 2016, I was preparing and moved to NY to join graduate school and during the initial months of moving to NY, I was still logging those kilometers due to the vicinity of Central Park near my house and favorable weather conditions but as the semester proceeded and the workload increased along with extreme temperature, the running kilometers reduced significantly to even less than 10km, which is really insignificant when compared to the previous years peak running months, i.e., November and Decemeber. But I hope to be back to running soon, and doing this analysis just made me realize how important running is for my happiness.

We will now proceed to the section where we will try and obtain more trends from the data, analyze the distributions of important variables in the dataset and obtain learning by analyzing all the created visualizations from the data.

V. Main Data Analysis and Visualizations

In this section, we will analyze the identified important features through our data quality analysis, namely, Activity Minutes, Activity Month, Activity Day and Calories Burnt.

i. Analysis of the feature - Activity Minutes

a. Distribution Histogram with overlayed Density Plots, with varying binwidths also

In [41]:
%%R -i running_data -w 900 -h 480 -u px

require("ggplot2")

g1 <- ggplot(running_data, aes(x=activityMins)) + 
    geom_histogram(aes(y=..count..),      # Histogram with density instead of count on y-axis
                   binwidth=1,
                   colour="black", fill="white") +
    geom_density(aes(y=..count..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Count") +
    xlab("Activity Minutes") +
    ggtitle("Activity Minutes overlayed with density estimate : Binwidth 1") 
g1
In [42]:
%%R -i running_data -w 900 -h 480 -u px

require("ggplot2")

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
g1 <- ggplot(running_data, aes(x=activityMins)) + 
    geom_histogram(aes(y=..count..),      # Histogram with density instead of count on y-axis
                   binwidth=1,
                   colour="black", fill="white") +
    geom_density(aes(y=..count..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Count") +
    xlab("Activity Minutes") +
    ggtitle("Activity Minutes overlayed with density estimate : Binwidth 1")

g2 <- ggplot(running_data, aes(x=activityMins)) + 
    geom_histogram(aes(y=..count..),      # Histogram with density instead of count on y-axis
                   binwidth=2,
                   colour="black", fill="white") +
    geom_density(aes(y=..count..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Count") +
    xlab("Activity Minutes") +
    ggtitle("Activity Minutes overlayed with density estimate : Binwidth 2")

g3 <- ggplot(running_data, aes(x=activityMins)) + 
    geom_histogram(aes(y=..count..),      # Histogram with density instead of count on y-axis
                   binwidth=4,
                   colour="black", fill="white") +
    geom_density(aes(y=..count..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Count") +
    xlab("Activity Minutes") +
    ggtitle("Activity Minutes overlayed with density estimate : Binwidth 4")

g4 <- ggplot(running_data, aes(x=activityMins)) + 
    geom_histogram(aes(y=..count..),      # Histogram with density instead of count on y-axis
                   binwidth=6,
                   colour="black", fill="white") +
    geom_density(aes(y=..count..), alpha=.2, fill="#FF6666") +# Overlay with transparent density plot
    ylab("Count") +
    xlab("Activity Minutes") +
    ggtitle("Activity Minutes overlayed with density estimate : Binwidth 6")

multiplot(g1, g2, g3, g4, cols=2)

Analysis of the above visualization:

  1. In the plot above, we see a distribution of Activity Minutes on the x-axis along with the count of those activities on the y-axis.
  2. The Distribution is clearly multi-modal skewed towards the right. Varying the binwidths also led to similar distributions showing the underlying pattern of the distribution.
  3. Reasons for skewness include the practice runs while I was training for my first full Marathon in 2016, the full marathon and also instances when I forgot to stop the timer while recording activities.
  4. Activity Minutes data obtained was mostly clean, except for instances when I forgot to turn of the fitness tracker resulting into longer workouts, but they were cleaned during the data cleaning process.

b. Distribution of Activity Minutes across various Activities

In [43]:
%%R -i running_data -w 900 -h 480 -u px

require("ggplot2")
require("viridis")

g1 <- ggplot(running_data, aes(x=activityMins, fill=activityType)) + 
    geom_density(alpha=0.5, adjust = 1, na.rm = TRUE) +   # Overlay with transparent density plot
    ylab("Density") +
    xlab("Activity Minutes") + scale_fill_viridis(discrete=TRUE) +
    ggtitle("Distribution of Activity Minutes for the different Activity Types")
g1