Multiple Linear Regression Tutorial

This document will serve as an introduction to building multiple linear regression models between reference grade data and low-cost sensor data.

For the purpose of this tutorial, we will need the packages lubridate, tidyverse (which includes the packages dplyr, stringr, readr, purr, tibble and ggplot2), caTools and SimDesign. You can install packages by typing in the r-console install.packages(“package”).

Loading required libraries

library(tidyverse)
library(lubridate)
library(SimDesign)
library(caTools)

Loading and Cleaning Data

We will begin with a folder of multiple .csv files containing the purple air data. We will first set our working directory to this folder in order to load the files.

Load in Data

As we have many files to load and they are all in one folder that includes files of other file types, we must be careful with how we load the csv files. To do this we will load all of the csv files into a list, select the columns of interest, use string scraping methods to correctly format the data and then combine all the files into one dataset for analysis.

Here we are loading all of the files into a list and also creating a vector of strings that list the columns that I am interested in for analysis. Purple Air’s have two sensors, “a” and “b”. I am interested in \(PM_{2.5}\) data, so I will load data from both sensors. For this data, I am interested in the values with the “atm” correction factor. I will also load the columns that will be used in my regression analysis.

Note : There are many files, so it may take a while to load all of them

# Set the Working Directory
setwd("C:/Users/cmcfa/OneDrive/Desktop/2020 Research/Accra/aqm")

#unpack all csv files into a list
purplefile <- list.files(pattern = ".csv")

#list columns of interest
impcolum <- c("UTCDateTime","current_temp_f", "current_humidity","current_dewpoint_f", "pressure","pm2_5_atm","pm2_5_atm_b")

Next I will begin to build a data frame to combine all of these files into one data frame object. I will do this by creating an empty matrix with the number of columns that I will want in my file data frame. I will convert this matrix to a data frame and then name the columns appropriately.

Note: I am making this document in rmarkdown, which requires you to reset the working directory for each code chunk. In an actual rscript, you would not need to set the working directory repeatedly, unless you intend to change it.

# Set the Working Directory
setwd("C:/Users/cmcfa/OneDrive/Desktop/2020 Research/Accra/aqm")
#make an empty data frame to store values that we are interested in
purple <- data.frame(matrix(ncol = 7))
colnames(purple) <- impcolum

#Loop through list of csv files
for(i in purplefile){
  file <- read_csv(i, col_names = TRUE, quote = "") #read in each file, including column names, note in this file, character strings are not delimited by any special character so quote is set to an empty string ""
  file2 <- subset(file, select = impcolum) #select columns of interest from each file
  purple <- rbind(purple, file2) #add subsetted data frame to data frame we initialized before hand
}

#remove null values from data frame                     
purple <- na.omit(purple)

Clean Data

Now that we have a single data frame with all of my data, we must make sure it is in a format suitable for analysis. For this we will use text scraping methods. The code shown below works in this particular instance, but these steps are very dependent on how your code is structured. The goal is to find and exploit patterns in the text in your data. I find the following link very helpful: https://stringr.tidyverse.org/

This “cheat sheet” is also very helpful:

When exploring the data we see that there are many random characters in the date column. Our goal is to remove those characters and then reformat the date column into a format suitable for analysis.

#replace random letters in date column with spaces 
purple$UTCDateTime <- gsub("[A-Za-z]", " ", purple$UTCDateTime)

#Extract characters in date format

#values in brackets  represent range of possible values, values in braces represent number of values to search for
#desired pattern for dates:
pattern <- "[0-9]{4}[/][0-9]{2}[/][0-9]{2}[ ][0-9]{2}[:][0-9]{2}[:][0-9]{2}" 
purple$UTCDateTime <- regmatches(purple$UTCDateTime, regexpr(pattern, purple$UTCDateTime)) 

#convert column to date data type, note time zone is UTC
purple$UTCDateTime <- as.POSIXct(purple$UTCDateTime, format = "%Y/%m/%d %H:%M:%OS", tz = "UTC")

#For these purposes, I am only interested in values after March 2020 by subsetting the data frame using indexing
purple <- purple[purple$UTCDateTime >= as.POSIXct("2020-03-01", tz = "UTC"),]

Next, I made some preliminary plots of the data to inspect for outliars. We will discuss plotting later in this tutorial, but here is the code that I used to remove outliars.

We then can export the data for later use.

#I preliminarily plotted the date vs pm2.5 data to visulaize outliars
#remove rows 2020-06-23 22:09:31, 2020-09-04 00:02:00 as outliars. Determined by inspection of values >400 that did not correlate with adjacent values

#It may also be worth performing premlinary checks on temperature and RH outliars

purple <- purple[purple$UTCDateTime != as.POSIXct("2020-06-23 22:09:31", tz = "UTC"),] 
purple <- purple[purple$UTCDateTime != as.POSIXct("2020-09-04 00:02:00", tz = "UTC"),]

#export to a single csv for later use
write.csv(purple, "C:/Users/cmcfa/OneDrive/Desktop/2020 Research/Accra/Purpleair_clean_02_15_22.csv", row.names = FALSE)

Building Multiple Linear Regression Models

To build these models I need both the low cost sensor data and the reference grade data. To begin, I will import the reference grade data. It is also important to pay attention to time zones. The Embassy data is in local time, which just so happens to be GMT/UTC, which is the same as the Purple Air data,

Note: The reference grade data is already on an hourly level, where as the Purple Air Data is taken in approximately 90 second intervals.

I will also reload the clean Purple Air Data. Although this data frame was created above, I am sourcing froma previously existing rscript, which is why this step may appear to be redundant.

Load in Reference Grade Data

#set the working directory to the location of the Embassy data
setwd("C:/Users/cmcfa/OneDrive/Desktop/2020 Research/Accra/")

#load in data
purpledata <- read.csv("Purpleair_clean_11_18_20.csv")
embassydata <- read.csv("Accra_PM2.5_2020_YTD.csv")

#make sure dates are in a date data type
purpledata$date <- as.POSIXct(purpledata$date, format = "%Y-%m-%d %H:%M:%OS", tz = "UTC")
embassydata$date <- as.POSIXct(embassydata$Date..LT., format = "%Y-%m-%d %I:%M %p", tz = "UTC")

# I only want embassy dates that align with Purple Air data-- past march 2020 and before september 4th 2020
embassydata <- filter(embassydata, 
                      between(embassydata$date, 
                              as.POSIXct("2020-03-02",tz = "UTC"),
                              as.POSIXct("2020-09-04", tz = "UTC")))

#only keep columns relevant to MLR model
embassydata <- embassydata[,c("date", "Raw.Conc.")]
embassydata <- embassydata[embassydata$Raw.Conc. > 0 ,]

For the purposes of this tutorial, we will be creating a daily averaged MLR model. Yoyu could easily create and hourly model by using the hourly Embassy data and hour averaging the Purple Air data.

Here we will format the data into a format suitable for analysis and then day average the data.

Note: In the case of day averaging I round to the “floor” meaning, any time past midnight is associated with the day it is stamped with. If I wanted to hour average the data I would need to be careful. The hourly Embassy data begins at 1:00 AM. This means the values are averaged to the ceiling, so I would want to ceiling average the hourly Purple Air values. This could be done by replacing the term floor with ceiling in the code below.

Day Average Data

#Average the two Purple Air Sensors
purpledata$pm2_5avg <- ave(purpledata$pm2_5_atm, purpledata$pm2_5_atm_b)

#put temperature in celsius
purpledata$temp_C <- (purpledata$current_temp_f - 32)*(5/9)
purpledata$dewpoint_C <- (purpledata$current_dewpoint_f - 32)*(5/9)

#day average data
purpleday <- 
  purpledata %>% #%>% is called the "pipe" and acts as a way to compose functions within each other
  
  group_by(date = floor_date(date, "day")) %>% #if I wanted to do hourly, I would use ceiling_date(date, "hour"). This line of code creates a column date by grouping all date values who have the same floor_date values
  
  dplyr::summarise( #summarise creates a new data frame with the information listed in its parentheses
    purpleair_pm2.5 = mean(pm2_5avg, na.rm = TRUE), #average these values within these groups, the na.rm term removes any NA values from being considered in the average. If an NA term is included, the mean will return as an NA value
    temp = mean(temp_C, na.rm = TRUE), 
    RH = mean(current_humidity, na.rm = TRUE),
    dewpoint = mean(dewpoint_C, na.rm = TRUE))

embassyday <- 
  embassydata %>%
  group_by(date = floor_date(date, "day")) %>% 
  dplyr::summarise(
    embassy_pm2.5 = mean(Raw.Conc., na.rm = TRUE))

#merge the two day averaged data frames into one based on the dates that they have in common
purple_embassy_day <- merge(purpleday, embassyday, by = "date")

#For some reason the Embassy values reports negative PM2.5 values, these are assumed to be outliars so we will remove them
purple_embassy_day <- purple_embassy_day[purple_embassy_day$embassy_pm2.5 > 0 ,]

Build Models

Now we will actually build the models. We will begin by splitting the data into test and training data sets. This is to prevent over-fitting of the model. By having a separate testing dataset, we can get more accurate model summary statistics. We will use a 75/25 split, which is industry standard. As splitting the data is a random process, we must set a “seed”, which is a basis for the randomization of the process which will allow the results to be split the same each time and for the model to be reproducible.

#set the seed so the results are reproducible
set.seed(70) 

#split the sample into test and training sets
sample = sample.split(purple_embassy_day$purpleair_pm2.5, SplitRatio = .75)
train = subset(purple_embassy_day, sample == TRUE)
test  = subset(purple_embassy_day, sample == FALSE)

#We will create a MLR model excluding dew point for the purpose of this tutorial. The lm function takes in the formula for the regression as y ~ dependent variables in addition to the dataframe to pull from
model <- lm(embassy_pm2.5 ~ purpleair_pm2.5 + temp + RH, data = train)
model_summary <- summary(model)

#view model summary-- note the statistics here are applied to the training set only
model_summary

## 
## Call:
## lm(formula = embassy_pm2.5 ~ purpleair_pm2.5 + temp + RH, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.156 -1.692 -0.306  1.148 18.519 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -47.10795   15.07404  -3.125  0.00217 ** 
## purpleair_pm2.5   0.54421    0.03246  16.765  < 2e-16 ***
## temp              1.52128    0.25218   6.032 1.44e-08 ***
## RH                0.12442    0.13125   0.948  0.34482    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.099 on 136 degrees of freedom
## Multiple R-squared:  0.6747, Adjusted R-squared:  0.6675 
## F-statistic: 94.02 on 3 and 136 DF,  p-value: < 2.2e-16

#we can also view the coefficients alone with the expression
coefficients(model)

##     (Intercept) purpleair_pm2.5            temp              RH 
##     -47.1079497       0.5442138       1.5212789       0.1244192

Evaluate Model with Summary Statistics

We can then apply the model to the test data frame and calculate statistics like R, \(R^2\) and MAE. It is also good to calculate statistics on the raw data in order to establish a baseline for comparision.

#apply the model to the test dataset (good for reporting statistics)
test$pm2.5_c <- predict(model, test)

#apply to the entire dataset (good for plotting since the test dataframe is so small )
purple_embassy_day$pm2.5_c<- predict(model, purple_embassy_day)

#ca
model_stats <- data.frame(
  pearsoncorrtest = cor(test$pm2.5_c, test$embassy_pm2.5, use = "everything", method = "pearson"),  #R value for test data
  pearsoncorrall = cor(purple_embassy_day$pm2.5_c, purple_embassy_day$embassy_pm2.5, use = "everything", method = "pearson"), #r value for the entire data frame
  pearsoncorrraw = cor(purple_embassy_day$purpleair_pm2.5 , purple_embassy_day$embassy_pm2.5, use = "everything", method = "pearson"), #r value for the entire raw data 
                       
  R2_test = (cor(test$pm2.5_c, test$embassy_pm2.5, use = "everything", method = "pearson"))^2, #R^2 for test data
  R2_all = (cor(purple_embassy_day$pm2.5_c, purple_embassy_day$embassy_pm2.5, use = "everything", method = "pearson"))^2, #R^2 value for the entire data frame
  R2_raw = (cor(purple_embassy_day$purpleair_pm2.5, purple_embassy_day$embassy_pm2.5, use = "everything", method = "pearson"))^2, #R^2 value for the raw data
  
  MAE_test = MAE(test$pm2.5_c, test$embassy_pm2.5), #MAE for test data
  MAE_all = MAE(purple_embassy_day$pm2.5_c, purple_embassy_day$embassy_pm2.5),
  MAE_raw = MAE(purple_embassy_day$purpleair_pm2.5, purple_embassy_day$embassy_pm2.5)
  )

#print the statistics
model_stats

##   pearsoncorrtest pearsoncorrall pearsoncorrraw   R2_test    R2_all    R2_raw
## 1       0.8424771      0.8264668      0.6891771 0.7097676 0.6830473 0.4749651
##   MAE_test  MAE_all  MAE_raw
## 1 2.514547 2.237679 6.498295

You can also further evaluate dta by plotting residuals of different variables. Residuals of models should be randomly distributed. If they are not, it implies there may be another relationship present that is not caputred by the model. This suggests a potential need to explore other models.

Plot Data

We will create a general plot of corrected data versus raw data. Note the theme we are using is one that I have custom made. R has many premade themes available which may be easiest for beginners. These themes can be found here: https://ggplot2.tidyverse.org/reference/ggtheme.html

To make plots I use ggplot2. In order to access the fonts I like, I load additional libraries. This step is optional.

Ggplot works by creating a base ggplot object and then adding on different plot features

#libraries for extra fonts
library(ggpubr)
library(extrafont)
extrafont::loadfonts(device="win")
font = "Calibri"

#General theme I use for my plots. Information on how I developed this theme can be found here: https://ggplot2.tidyverse.org/reference/theme.html

#I would also encourage to read about the role of "aesthetics" in ggplot: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html

theme_ <- theme(legend.key = element_rect(fill = "white"), 
        legend.text = element_text(family = font, size = 12),
        legend.text.align = 0,
        plot.title = element_text(hjust = 0.5), 
        legend.title = element_blank(),
        legend.position = "top",
        legend.direction = "horizontal", 
        axis.text.x = element_text(family = font, size = 12),
        axis.text.y = element_text(family = font, size = 12),
        axis.title.x = element_text(family = font, face = "bold", size = 13),
        axis.title.y = element_text(family = font, face = "bold",size = 13),
        axis.line = element_blank(),
        axis.ticks = element_line(color = "gray90"),
        panel.background = element_blank(),
        panel.grid.major = element_line(color = "gray90"),
        panel.grid.minor = element_blank(),
        strip.background = element_blank())

plot <- ggplot(data = purple_embassy_day) + #set dataframe to use in plot, note you could leave this blank and include a data = "" in each geom object you add
  
  geom_line(aes(x = date, y = purpleair_pm2.5, color = "Purple Air")) + #create a line fo the raw purple air data. By including the name in the aesthetics, you create a dummy variable that you can assign a color to in the legend
  
  geom_line(aes(x = date, y = embassy_pm2.5, color = "Embassy")) + #embassy data
  
  geom_line(aes(x = date, y = pm2.5_c, color = "Corrected Purple Air ")) + #corrected data
  
  labs(x = "Date (2020)" , y = expression(bold(paste(PM[2.5]~ "(", mu, "g/m"^"3",")")))) + #set x and y labels. In order to get subscripts and greek letters you must use mathematical expressions. Information on this can be found here:
  #https://astrostatistics.psu.edu/su07/R/html/grDevices/html/plotmath.html
  
  scale_colour_manual(values = c("orange", "mediumturquoise", "purple4")) + #this assigns colors to the aesthetics you set, it uses alphabetical order 
  
  ggtitle(label = expression(bold(paste("Calibrated Purple Air ", PM[2.5])))) +
  
  scale_x_datetime(breaks = seq(as.POSIXct("2020-03-01"), as.POSIXct("2020-10-01"), by = "1 month"), date_labels = "%b", expand = expansion(mult = c(0.005,0.01))) +  #set x axis to display dates by one month at a time. %b denotes letter type abbreviation of dates, the expand term formats how the x axis positions itself next to the y-axis (refer to ggplot documentation)
    theme_

show(plot)

#saves the last plot shown to working directory
ggsave("pm2.5FATM_pa_emb_dailyMLR_02_15.png", width = 8, height = 7)