This document will serve as an introduction to building multiple linear regression models between reference grade data and low-cost sensor data.
For the purpose of this tutorial, we will need the packages lubridate, tidyverse (which includes the packages dplyr, stringr, readr, purr, tibble and ggplot2), caTools and SimDesign. You can install packages by typing in the r-console install.packages(“package”).
library(tidyverse)
library(lubridate)
library(SimDesign)
library(caTools)
We will begin with a folder of multiple .csv files containing the purple air data. We will first set our working directory to this folder in order to load the files.
As we have many files to load and they are all in one folder that includes files of other file types, we must be careful with how we load the csv files. To do this we will load all of the csv files into a list, select the columns of interest, use string scraping methods to correctly format the data and then combine all the files into one dataset for analysis.
Here we are loading all of the files into a list and also creating a vector of strings that list the columns that I am interested in for analysis. Purple Air’s have two sensors, “a” and “b”. I am interested in \(PM_{2.5}\) data, so I will load data from both sensors. For this data, I am interested in the values with the “atm” correction factor. I will also load the columns that will be used in my regression analysis.
Note : There are many files, so it may take a while to load all of them
# Set the Working Directory
setwd("C:/Users/cmcfa/OneDrive/Desktop/2020 Research/Accra/aqm")
#unpack all csv files into a list
purplefile <- list.files(pattern = ".csv")
#list columns of interest
impcolum <- c("UTCDateTime","current_temp_f", "current_humidity","current_dewpoint_f", "pressure","pm2_5_atm","pm2_5_atm_b")
Next I will begin to build a data frame to combine all of these files into one data frame object. I will do this by creating an empty matrix with the number of columns that I will want in my file data frame. I will convert this matrix to a data frame and then name the columns appropriately.
Note: I am making this document in rmarkdown, which requires you to reset the working directory for each code chunk. In an actual rscript, you would not need to set the working directory repeatedly, unless you intend to change it.
# Set the Working Directory
setwd("C:/Users/cmcfa/OneDrive/Desktop/2020 Research/Accra/aqm")
#make an empty data frame to store values that we are interested in
purple <- data.frame(matrix(ncol = 7))
colnames(purple) <- impcolum
#Loop through list of csv files
for(i in purplefile){
file <- read_csv(i, col_names = TRUE, quote = "") #read in each file, including column names, note in this file, character strings are not delimited by any special character so quote is set to an empty string ""
file2 <- subset(file, select = impcolum) #select columns of interest from each file
purple <- rbind(purple, file2) #add subsetted data frame to data frame we initialized before hand
}
#remove null values from data frame
purple <- na.omit(purple)
Now that we have a single data frame with all of my data, we must make sure it is in a format suitable for analysis. For this we will use text scraping methods. The code shown below works in this particular instance, but these steps are very dependent on how your code is structured. The goal is to find and exploit patterns in the text in your data. I find the following link very helpful: https://stringr.tidyverse.org/
This “cheat sheet” is also very helpful:
When exploring the data we see that there are many random characters in the date column. Our goal is to remove those characters and then reformat the date column into a format suitable for analysis.
#replace random letters in date column with spaces
purple$UTCDateTime <- gsub("[A-Za-z]", " ", purple$UTCDateTime)
#Extract characters in date format
#values in brackets represent range of possible values, values in braces represent number of values to search for
#desired pattern for dates:
pattern <- "[0-9]{4}[/][0-9]{2}[/][0-9]{2}[ ][0-9]{2}[:][0-9]{2}[:][0-9]{2}"
purple$UTCDateTime <- regmatches(purple$UTCDateTime, regexpr(pattern, purple$UTCDateTime))
#convert column to date data type, note time zone is UTC
purple$UTCDateTime <- as.POSIXct(purple$UTCDateTime, format = "%Y/%m/%d %H:%M:%OS", tz = "UTC")
#For these purposes, I am only interested in values after March 2020 by subsetting the data frame using indexing
purple <- purple[purple$UTCDateTime >= as.POSIXct("2020-03-01", tz = "UTC"),]
Next, I made some preliminary plots of the data to inspect for outliars. We will discuss plotting later in this tutorial, but here is the code that I used to remove outliars.
We then can export the data for later use.
#I preliminarily plotted the date vs pm2.5 data to visulaize outliars
#remove rows 2020-06-23 22:09:31, 2020-09-04 00:02:00 as outliars. Determined by inspection of values >400 that did not correlate with adjacent values
#It may also be worth performing premlinary checks on temperature and RH outliars
purple <- purple[purple$UTCDateTime != as.POSIXct("2020-06-23 22:09:31", tz = "UTC"),]
purple <- purple[purple$UTCDateTime != as.POSIXct("2020-09-04 00:02:00", tz = "UTC"),]
#export to a single csv for later use
write.csv(purple, "C:/Users/cmcfa/OneDrive/Desktop/2020 Research/Accra/Purpleair_clean_02_15_22.csv", row.names = FALSE)
To build these models I need both the low cost sensor data and the reference grade data. To begin, I will import the reference grade data. It is also important to pay attention to time zones. The Embassy data is in local time, which just so happens to be GMT/UTC, which is the same as the Purple Air data,
Note: The reference grade data is already on an hourly level, where as the Purple Air Data is taken in approximately 90 second intervals.
I will also reload the clean Purple Air Data. Although this data frame was created above, I am sourcing froma previously existing rscript, which is why this step may appear to be redundant.
#set the working directory to the location of the Embassy data
setwd("C:/Users/cmcfa/OneDrive/Desktop/2020 Research/Accra/")
#load in data
purpledata <- read.csv("Purpleair_clean_11_18_20.csv")
embassydata <- read.csv("Accra_PM2.5_2020_YTD.csv")
#make sure dates are in a date data type
purpledata$date <- as.POSIXct(purpledata$date, format = "%Y-%m-%d %H:%M:%OS", tz = "UTC")
embassydata$date <- as.POSIXct(embassydata$Date..LT., format = "%Y-%m-%d %I:%M %p", tz = "UTC")
# I only want embassy dates that align with Purple Air data-- past march 2020 and before september 4th 2020
embassydata <- filter(embassydata,
between(embassydata$date,
as.POSIXct("2020-03-02",tz = "UTC"),
as.POSIXct("2020-09-04", tz = "UTC")))
#only keep columns relevant to MLR model
embassydata <- embassydata[,c("date", "Raw.Conc.")]
embassydata <- embassydata[embassydata$Raw.Conc. > 0 ,]
For the purposes of this tutorial, we will be creating a daily averaged MLR model. Yoyu could easily create and hourly model by using the hourly Embassy data and hour averaging the Purple Air data.
Here we will format the data into a format suitable for analysis and then day average the data.
Note: In the case of day averaging I round to the “floor” meaning, any time past midnight is associated with the day it is stamped with. If I wanted to hour average the data I would need to be careful. The hourly Embassy data begins at 1:00 AM. This means the values are averaged to the ceiling, so I would want to ceiling average the hourly Purple Air values. This could be done by replacing the term floor with ceiling in the code below.
#Average the two Purple Air Sensors
purpledata$pm2_5avg <- ave(purpledata$pm2_5_atm, purpledata$pm2_5_atm_b)
#put temperature in celsius
purpledata$temp_C <- (purpledata$current_temp_f - 32)*(5/9)
purpledata$dewpoint_C <- (purpledata$current_dewpoint_f - 32)*(5/9)
#day average data
purpleday <-
purpledata %>% #%>% is called the "pipe" and acts as a way to compose functions within each other
group_by(date = floor_date(date, "day")) %>% #if I wanted to do hourly, I would use ceiling_date(date, "hour"). This line of code creates a column date by grouping all date values who have the same floor_date values
dplyr::summarise( #summarise creates a new data frame with the information listed in its parentheses
purpleair_pm2.5 = mean(pm2_5avg, na.rm = TRUE), #average these values within these groups, the na.rm term removes any NA values from being considered in the average. If an NA term is included, the mean will return as an NA value
temp = mean(temp_C, na.rm = TRUE),
RH = mean(current_humidity, na.rm = TRUE),
dewpoint = mean(dewpoint_C, na.rm = TRUE))
embassyday <-
embassydata %>%
group_by(date = floor_date(date, "day")) %>%
dplyr::summarise(
embassy_pm2.5 = mean(Raw.Conc., na.rm = TRUE))
#merge the two day averaged data frames into one based on the dates that they have in common
purple_embassy_day <- merge(purpleday, embassyday, by = "date")
#For some reason the Embassy values reports negative PM2.5 values, these are assumed to be outliars so we will remove them
purple_embassy_day <- purple_embassy_day[purple_embassy_day$embassy_pm2.5 > 0 ,]
Now we will actually build the models. We will begin by splitting the data into test and training data sets. This is to prevent over-fitting of the model. By having a separate testing dataset, we can get more accurate model summary statistics. We will use a 75/25 split, which is industry standard. As splitting the data is a random process, we must set a “seed”, which is a basis for the randomization of the process which will allow the results to be split the same each time and for the model to be reproducible.
#set the seed so the results are reproducible
set.seed(70)
#split the sample into test and training sets
sample = sample.split(purple_embassy_day$purpleair_pm2.5, SplitRatio = .75)
train = subset(purple_embassy_day, sample == TRUE)
test = subset(purple_embassy_day, sample == FALSE)
#We will create a MLR model excluding dew point for the purpose of this tutorial. The lm function takes in the formula for the regression as y ~ dependent variables in addition to the dataframe to pull from
model <- lm(embassy_pm2.5 ~ purpleair_pm2.5 + temp + RH, data = train)
model_summary <- summary(model)
#view model summary-- note the statistics here are applied to the training set only
model_summary
##
## Call:
## lm(formula = embassy_pm2.5 ~ purpleair_pm2.5 + temp + RH, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.156 -1.692 -0.306 1.148 18.519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -47.10795 15.07404 -3.125 0.00217 **
## purpleair_pm2.5 0.54421 0.03246 16.765 < 2e-16 ***
## temp 1.52128 0.25218 6.032 1.44e-08 ***
## RH 0.12442 0.13125 0.948 0.34482
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.099 on 136 degrees of freedom
## Multiple R-squared: 0.6747, Adjusted R-squared: 0.6675
## F-statistic: 94.02 on 3 and 136 DF, p-value: < 2.2e-16
#we can also view the coefficients alone with the expression
coefficients(model)
## (Intercept) purpleair_pm2.5 temp RH
## -47.1079497 0.5442138 1.5212789 0.1244192
We can then apply the model to the test data frame and calculate statistics like R, \(R^2\) and MAE. It is also good to calculate statistics on the raw data in order to establish a baseline for comparision.
#apply the model to the test dataset (good for reporting statistics)
test$pm2.5_c <- predict(model, test)
#apply to the entire dataset (good for plotting since the test dataframe is so small )
purple_embassy_day$pm2.5_c<- predict(model, purple_embassy_day)
#ca
model_stats <- data.frame(
pearsoncorrtest = cor(test$pm2.5_c, test$embassy_pm2.5, use = "everything", method = "pearson"), #R value for test data
pearsoncorrall = cor(purple_embassy_day$pm2.5_c, purple_embassy_day$embassy_pm2.5, use = "everything", method = "pearson"), #r value for the entire data frame
pearsoncorrraw = cor(purple_embassy_day$purpleair_pm2.5 , purple_embassy_day$embassy_pm2.5, use = "everything", method = "pearson"), #r value for the entire raw data
R2_test = (cor(test$pm2.5_c, test$embassy_pm2.5, use = "everything", method = "pearson"))^2, #R^2 for test data
R2_all = (cor(purple_embassy_day$pm2.5_c, purple_embassy_day$embassy_pm2.5, use = "everything", method = "pearson"))^2, #R^2 value for the entire data frame
R2_raw = (cor(purple_embassy_day$purpleair_pm2.5, purple_embassy_day$embassy_pm2.5, use = "everything", method = "pearson"))^2, #R^2 value for the raw data
MAE_test = MAE(test$pm2.5_c, test$embassy_pm2.5), #MAE for test data
MAE_all = MAE(purple_embassy_day$pm2.5_c, purple_embassy_day$embassy_pm2.5),
MAE_raw = MAE(purple_embassy_day$purpleair_pm2.5, purple_embassy_day$embassy_pm2.5)
)
#print the statistics
model_stats
## pearsoncorrtest pearsoncorrall pearsoncorrraw R2_test R2_all R2_raw
## 1 0.8424771 0.8264668 0.6891771 0.7097676 0.6830473 0.4749651
## MAE_test MAE_all MAE_raw
## 1 2.514547 2.237679 6.498295
You can also further evaluate dta by plotting residuals of different variables. Residuals of models should be randomly distributed. If they are not, it implies there may be another relationship present that is not caputred by the model. This suggests a potential need to explore other models.
We will create a general plot of corrected data versus raw data. Note the theme we are using is one that I have custom made. R has many premade themes available which may be easiest for beginners. These themes can be found here: https://ggplot2.tidyverse.org/reference/ggtheme.html
To make plots I use ggplot2. In order to access the fonts I like, I load additional libraries. This step is optional.
Ggplot works by creating a base ggplot object and then adding on different plot features
#libraries for extra fonts
library(ggpubr)
library(extrafont)
extrafont::loadfonts(device="win")
font = "Calibri"
#General theme I use for my plots. Information on how I developed this theme can be found here: https://ggplot2.tidyverse.org/reference/theme.html
#I would also encourage to read about the role of "aesthetics" in ggplot: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html
theme_ <- theme(legend.key = element_rect(fill = "white"),
legend.text = element_text(family = font, size = 12),
legend.text.align = 0,
plot.title = element_text(hjust = 0.5),
legend.title = element_blank(),
legend.position = "top",
legend.direction = "horizontal",
axis.text.x = element_text(family = font, size = 12),
axis.text.y = element_text(family = font, size = 12),
axis.title.x = element_text(family = font, face = "bold", size = 13),
axis.title.y = element_text(family = font, face = "bold",size = 13),
axis.line = element_blank(),
axis.ticks = element_line(color = "gray90"),
panel.background = element_blank(),
panel.grid.major = element_line(color = "gray90"),
panel.grid.minor = element_blank(),
strip.background = element_blank())
plot <- ggplot(data = purple_embassy_day) + #set dataframe to use in plot, note you could leave this blank and include a data = "" in each geom object you add
geom_line(aes(x = date, y = purpleair_pm2.5, color = "Purple Air")) + #create a line fo the raw purple air data. By including the name in the aesthetics, you create a dummy variable that you can assign a color to in the legend
geom_line(aes(x = date, y = embassy_pm2.5, color = "Embassy")) + #embassy data
geom_line(aes(x = date, y = pm2.5_c, color = "Corrected Purple Air ")) + #corrected data
labs(x = "Date (2020)" , y = expression(bold(paste(PM[2.5]~ "(", mu, "g/m"^"3",")")))) + #set x and y labels. In order to get subscripts and greek letters you must use mathematical expressions. Information on this can be found here:
#https://astrostatistics.psu.edu/su07/R/html/grDevices/html/plotmath.html
scale_colour_manual(values = c("orange", "mediumturquoise", "purple4")) + #this assigns colors to the aesthetics you set, it uses alphabetical order
ggtitle(label = expression(bold(paste("Calibrated Purple Air ", PM[2.5])))) +
scale_x_datetime(breaks = seq(as.POSIXct("2020-03-01"), as.POSIXct("2020-10-01"), by = "1 month"), date_labels = "%b", expand = expansion(mult = c(0.005,0.01))) + #set x axis to display dates by one month at a time. %b denotes letter type abbreviation of dates, the expand term formats how the x axis positions itself next to the y-axis (refer to ggplot documentation)
theme_
show(plot)
#saves the last plot shown to working directory
ggsave("pm2.5FATM_pa_emb_dailyMLR_02_15.png", width = 8, height = 7)