This document will serve as an introduction to building multiple linear regression models between reference grade data and low-cost sensor data.

For the purpose of this tutorial, we will need the packages lubridate, tidyverse (which includes the packages dplyr, stringr, readr, purr, tibble and ggplot2), caTools and SimDesign. You can install packages by typing in the r-console install.packages(“package”).

Loading required libraries

library(tidyverse)
library(lubridate)
library(SimDesign)
library(caTools)

Loading and Cleaning Data

We will begin with a folder of multiple .csv files containing the purple air data. We will first set our working directory to this folder in order to load the files.

Load in Data

As we have many files to load and they are all in one folder that includes files of other file types, we must be careful with how we load the csv files. To do this we will load all of the csv files into a list, select the columns of interest, use string scraping methods to correctly format the data and then combine all the files into one dataset for analysis.

Here we are loading all of the files into a list and also creating a vector of strings that list the columns that I am interested in for analysis. Purple Air’s have two sensors, “a” and “b”. I am interested in \(PM_{2.5}\) data, so I will load data from both sensors. For this data, I am interested in the values with the “atm” correction factor. I will also load the columns that will be used in my regression analysis.

Note : There are many files, so it may take a while to load all of them

# Set the Working Directory
setwd("C:/Users/cmcfa/OneDrive/Desktop/2020 Research/Accra/aqm")

#unpack all csv files into a list
purplefile <- list.files(pattern = ".csv")

#list columns of interest
impcolum <- c("UTCDateTime","current_temp_f", "current_humidity","current_dewpoint_f", "pressure","pm2_5_atm","pm2_5_atm_b")

Next I will begin to build a data frame to combine all of these files into one data frame object. I will do this by creating an empty matrix with the number of columns that I will want in my file data frame. I will convert this matrix to a data frame and then name the columns appropriately.

Note: I am making this document in rmarkdown, which requires you to reset the working directory for each code chunk. In an actual rscript, you would not need to set the working directory repeatedly, unless you intend to change it.

# Set the Working Directory
setwd("C:/Users/cmcfa/OneDrive/Desktop/2020 Research/Accra/aqm")
#make an empty data frame to store values that we are interested in
purple <- data.frame(matrix(ncol = 7))
colnames(purple) <- impcolum

#Loop through list of csv files
for(i in purplefile){
  file <- read_csv(i, col_names = TRUE, quote = "") #read in each file, including column names, note in this file, character strings are not delimited by any special character so quote is set to an empty string ""
  file2 <- subset(file, select = impcolum) #select columns of interest from each file
  purple <- rbind(purple, file2) #add subsetted data frame to data frame we initialized before hand
}

#remove null values from data frame                     
purple <- na.omit(purple)

Clean Data

Now that we have a single data frame with all of my data, we must make sure it is in a format suitable for analysis. For this we will use text scraping methods. The code shown below works in this particular instance, but these steps are very dependent on how your code is structured. The goal is to find and exploit patterns in the text in your data. I find the following link very helpful: https://stringr.tidyverse.org/

This “cheat sheet” is also very helpful: