A while ago I discovered that you can export all of your data in Apple Health. This includes things like sleep metrics, heart rates, exercise routes, steps per day, etc. - pretty much anything that is logged by an Apple Watch (or input by another app). I was curious to see how I could play around with it in R!

Export your Health Data:

For obvious reasons, I won't be sharing my own health export, but it it's pretty easy to get yours provided you have Apple Health set up. Just go into the app, then tap your profile picture, and at the bottom there'll be an option to export:

It'll take a while to do this, but once it's ready you can email it, upload to cloud storage, airdrop it, all the normal things. You'll need to unzip the export.zip file and place those files and folders where you can get to them.

The main file we'll be using is in xml format, something I haven't had to use much though it's not too bad to work with. The main problem I've encountered is that when reading xml files into R the uncompressed size is quite large and since R stores objects in its environment in memory, you might have to implement some memory-management techniques. For me, however, 16gb of RAM was more than enough.

Read in the XML file:

In R, we'll use the following libraries:

library(XML)
library(data.table)
library(lubridate)
library(ggplot2)
library(stringr)
library(leaflet)

The first thing will be to import the xml file:

# import
xml_dat = xmlParse("Data/apple_health_export/export.xml")

This will a few seconds. The xml file by itself was ~ 500 mb but when read in like this it takes nearly 3 gb of my machine's RAM. There are three main types of data we'll be exploring from here: workouts, records, and activity summaries. I'm splitting them here and doing some date transformations with lubridate.

# split records & clean times
health_dat = as.data.table(XML:::xmlAttrsToDataFrame(xml_dat["//Record"]))
health_dat[, startDate := ymd_hms(startDate, tz = "America/Denver")]
health_dat[, endDate := ymd_hms(endDate, tz = "America/Denver")]

# split workouts & clean times
workout_dat = as.data.table(XML:::xmlAttrsToDataFrame(xml_dat["//Workout"]))
workout_dat[, startDate := ymd_hms(startDate, tz = "America/Denver")]
workout_dat[, endDate := ymd_hms(endDate, tz = "America/Denver")]

# split summaries & clean times
summary_dat = as.data.table(XML:::xmlAttrsToDataFrame(xml_dat["//ActivitySummary"]))
summary_dat[, dateComponents := ymd(dateComponents)]

Metrics Available:

Which metrics I record will likely differ from yours, but you can look at which ones you have access to with these commands:

> unique(health_dat$type)
 [1] "HKQuantityTypeIdentifierDietaryWater"                  
 [2] "HKQuantityTypeIdentifierBodyMassIndex"                 
 [3] "HKQuantityTypeIdentifierHeight"                        
 [4] "HKQuantityTypeIdentifierBodyMass"                      
 [5] "HKQuantityTypeIdentifierHeartRate"                     
 [6] "HKQuantityTypeIdentifierRespiratoryRate"               
 [7] "HKQuantityTypeIdentifierStepCount"                     
 [8] "HKQuantityTypeIdentifierDistanceWalkingRunning"        
 [9] "HKQuantityTypeIdentifierBasalEnergyBurned"             
[10] "HKQuantityTypeIdentifierActiveEnergyBurned"            
[11] "HKQuantityTypeIdentifierFlightsClimbed"                
[12] "HKQuantityTypeIdentifierDietaryFatTotal"               
[13] "HKQuantityTypeIdentifierDietaryFatSaturated"           
[14] "HKQuantityTypeIdentifierDietaryCholesterol"            
[15] "HKQuantityTypeIdentifierDietarySodium"                 
[16] "HKQuantityTypeIdentifierDietaryCarbohydrates"          
[17] "HKQuantityTypeIdentifierDietaryFiber"                  
[18] "HKQuantityTypeIdentifierDietarySugar"                  
[19] "HKQuantityTypeIdentifierDietaryEnergyConsumed"         
[20] "HKQuantityTypeIdentifierDietaryProtein"                
[21] "HKQuantityTypeIdentifierDietaryPotassium"              
[22] "HKQuantityTypeIdentifierInhalerUsage"                  
[23] "HKQuantityTypeIdentifierAppleExerciseTime"             
[24] "HKQuantityTypeIdentifierDietaryCaffeine"               
[25] "HKQuantityTypeIdentifierRestingHeartRate"              
[26] "HKQuantityTypeIdentifierVO2Max"                        
[27] "HKQuantityTypeIdentifierWalkingHeartRateAverage"       
[28] "HKQuantityTypeIdentifierEnvironmentalAudioExposure"    
[29] "HKQuantityTypeIdentifierHeadphoneAudioExposure"        
[30] "HKQuantityTypeIdentifierWalkingDoubleSupportPercentage"
[31] "HKQuantityTypeIdentifierSixMinuteWalkTestDistance"     
[32] "HKQuantityTypeIdentifierAppleStandTime"                
[33] "HKQuantityTypeIdentifierWalkingSpeed"                  
[34] "HKQuantityTypeIdentifierWalkingStepLength"             
[35] "HKQuantityTypeIdentifierWalkingAsymmetryPercentage"    
[36] "HKQuantityTypeIdentifierStairAscentSpeed"              
[37] "HKQuantityTypeIdentifierStairDescentSpeed"             
[38] "HKDataTypeSleepDurationGoal"                           
[39] "HKQuantityTypeIdentifierAppleWalkingSteadiness"        
[40] "HKQuantityTypeIdentifierHeartRateRecoveryOneMinute"    
[41] "HKQuantityTypeIdentifierEnvironmentalSoundReduction"   
[42] "HKCategoryTypeIdentifierSleepAnalysis"                 
[43] "HKCategoryTypeIdentifierAppleStandHour"                
[44] "HKCategoryTypeIdentifierMindfulSession"                
[45] "HKCategoryTypeIdentifierAudioExposureEvent"            
[46] "HKQuantityTypeIdentifierHeartRateVariabilitySDNN"

And for workouts:

> unique(workout_dat$workoutActivityType)
[1] "HKWorkoutActivityTypeYoga"       "HKWorkoutActivityTypeWalking"   
[3] "HKWorkoutActivityTypeElliptical" "HKWorkoutActivityTypeCycling"   
[5] "HKWorkoutActivityTypeHiking"     "HKWorkoutActivityTypeRunning"

Each of these has a fairly self-explanatory name which is handy!

Example: Mindfulness Practice

Let's say I wanted to look at data from my mindfulness practice. I'll subset by the correct type as listed above:

# subset to mindfulness records
mindful_dat = health_dat[type == "HKCategoryTypeIdentifierMindfulSession"]

Using head(), I can see what the only meaningful information is the duration of the session which I can calculate using lubridate's difftime() function. I'm also adding a session ID variable.

# only meaningful item is duration
mindful_dat = mindful_dat[, .(startDate, endDate)]
mindful_dat[, sessionLength := difftime(endDate, startDate, units = "mins")]
mindful_dat[, sessionLength := as.numeric(sessionLength, units = "mins")]
mindful_dat[, sessionID := 1:.N]

Let's take a look:

> head(mindful_dat)
             startDate             endDate sessionLength sessionID
1: 2021-04-05 11:33:23 2021-04-05 11:34:23       1.00000         1
2: 2021-04-05 14:24:49 2021-04-05 14:29:49       5.00000         2
3: 2021-04-05 22:17:18 2021-04-05 22:18:18       1.00000         3
4: 2021-04-06 14:56:49 2021-04-06 15:08:20      11.51667         4
5: 2021-04-07 09:35:53 2021-04-07 09:48:58      13.08333         5
6: 2021-04-11 18:46:13 2021-04-11 18:51:13       5.00000         6

With this, I can definitely make a plot of my session lengths over time... but that's not very interesting. Instead, I'll match up this data with my heart rate data to see what my heart rate looks like during practice!

# subset to heart rate records
heart_dat = health_dat[type == "HKQuantityTypeIdentifierHeartRate"]
heart_dat = heart_dat[, .(startDate, endDate, HR = as.numeric(value))]

Now I'll loop through my data and get the heart rates before, during, and after each of the sessions to see if there's a difference.

# subset further to records within ± 30 minutes of session, not including workouts
hr_mind = data.table()
for (record in 1:nrow(mindful_dat)) {
  # subset mindful record
  mindful_record = mindful_dat[record]
  
  # add ± 30 minute variables
  mindful_record[, startDateMinus := startDate - minutes(30)]
  mindful_record[, endDatePlus := endDate + minutes(30)]
  plusminus = c(mindful_record$startDateMinus, mindful_record$endDatePlus)
  actual = c(mindful_record$startDate, mindful_record$endDate)
  
  # subset heart rate records
  hr_record = heart_dat[startDate %between% plusminus | endDate %between% plusminus]
  hr_record[, duringSession := ifelse(startDate %between% actual |
                                        endDate %between% actual, 1, 0)]
  
  # remove HR's during workout
  for (workout in 1:nrow(workout_dat)) {
    
    # subset workout record
    workout_record = workout_dat[workout]
    
    # find interval
    workout_interval = c(workout_record$startDate, workout_record$endDate)
    
    # label in hr_record
    hr_record[, duringWorkout := ifelse(startDate %between% workout_interval |
                                          endDate %between% workout_interval, 1, 0)]
  }
  hr_record = hr_record[duringWorkout == 0]
  hr_record[, duringWorkout := NULL]
  
  # label session
  hr_record[, sessionID := mindful_record$sessionID]
  
  # save
  hr_mind = rbind(hr_mind, hr_record)
}

Now I have a data.table of heart rates, each with indicators as to whether they were during a session or not.

> head(hr_mind)
             startDate             endDate HR duringSession sessionID
1: 2021-04-05 11:05:32 2021-04-05 11:05:32 96             0         1
2: 2021-04-05 11:10:19 2021-04-05 11:10:19 96             0         1
3: 2021-04-05 11:13:39 2021-04-05 11:13:39 92             0         1
4: 2021-04-05 11:12:59 2021-04-05 11:12:59 94             0         1
5: 2021-04-05 11:21:16 2021-04-05 11:21:16 94             0         1
6: 2021-04-05 11:21:50 2021-04-05 11:21:50 96             0         1

From here, I'll look at a boxplot comparing my heart rates both during and outside of a session:

# convert duringSession to factor
hr_mind[, duringSession := as.factor(duringSession)]

# look at boxplot
ggplot(hr_mind, aes(x = duringSession, y = HR, fill = duringSession)) +
  geom_boxplot() +
  theme_minimal()

This reminds me of a 2-sample t-test, let's remove the outliers and do that!

# remove outliers
hr_mind = hr_mind[,.SD[HR < quantile(HR, probs = 0.95)], by = duringSession]

# look at boxplot (again)
ggplot(hr_mind, aes(x = duringSession, y = HR, fill = duringSession)) +
  geom_boxplot() +
  theme_minimal()

# lm
mod = lm(HR ~ duringSession, data = hr_mind)
summary(mod)

> summary(mod)

Call:
lm(formula = HR ~ duringSession, data = hr_mind)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.911 -14.911  -3.911  11.743  43.089 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     84.911      0.476 178.377  < 2e-16 ***
duringSession   -7.654      1.183  -6.469 1.27e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.32 on 1765 degrees of freedom
Multiple R-squared:  0.02316,   Adjusted R-squared:  0.02261 
F-statistic: 41.85 on 1 and 1765 DF,  p-value: 1.273e-10

Well there's statistical evidence to suggest that my heart rates during and outside mindfulness sessions is different, though practically it might be that great of a difference. It is, however, interesting that the variation in my heart rate is much smaller when practicing.

Example: Hiking Routes

When exporting data, you'll also have a folder with gpx files that contain workout route information. I'm going to grab the file for my most recent hike and plot it:

# get hikes
hikes = workout_dat[workoutActivityType == "HKWorkoutActivityTypeHiking"]

# get most recent hike date
hike_date = as_date(hikes[endDate == max(endDate)]$endDate)

# fetch .gpx file
route_list = list.files("Data/apple_health_export/workout-routes/")
route = str_which(route_list, paste0(hike_date))
filename = paste0("Data/apple_health_export/workout-routes/",route_list[route])
hike_gpx = htmlTreeParse(file = filename, useInternalNodes = TRUE)

# get coords & elevation
coords = xpathSApply(hike_gpx, path = "//trkpt", fun = xmlAttrs)
elevation = xpathSApply(hike_gpx, path = "//trkpt/ele", fun = xmlValue)

# create data.table
hike_route = data.table(
  LAT = as.numeric(coords["lat",]),
  LON = as.numeric(coords["lon",]),
  ELEVATION = as.numeric(elevation)
)

# leaflet
leaflet() %>%
  addTiles() %>%
  addPolylines(data = hike_route,
               lat = ~ LAT,
               lng = ~ LON,
               color = "#AE2573")

My most recent hike was to the Hidden Falls in Curt Gowdy State Park here in Wyoming. Let's see the plot:

We can zoom in on the plot to see the level of detail the gpx files contain:

Neat! We can also look at the elevation change during the hike:

# ggplot elevation
hike_route[, TIME := 1:.N]
ggplot(hike_route, aes(x = TIME, y = ELEVATION)) + 
  geom_line(linewidth = 2) +
  theme_minimal()

Example: Standing Hours

Finally, to work with the summary records, let's look at a plot of my standing hours over time. This time, no data manipulation is really required:

# plot
ggplot(summary_dat, aes(x = dateComponents, y = as.numeric(appleStandHours))) +
  geom_point(color = "#7cb7a3", size = 2) + 
  geom_smooth(color = "#AE2573", fill = "#AE2573") +
  xlab("Date") +
  ylab("Standing Hours") +
  theme_minimal()

Summary

At any rate, there's a bunch of fun and interesting things to look at in the Apple Health data exports and this post has showcased a few of those things. I hope you have fun exploring!

All code can be found on this GitHub Gist.

Garrett C.

Apple Watch Data in R