Environmental Data Analysis BC ENV 3017


Statistics 4: Correlations and Linear Regressions

Linear regression estimates a linear equation that describes the relationship between two variables, correlation measures the strength of that relationship.

Simple linear regression


graph

Pearson correlation coefficient

The (Pearson) correlation coefficient r is a measure of the strength of a linear relationship. It ranges from -1 to +1. Some examples:

Negative Linear Relationship

graph

Positive linear relationship

graph

No correlation

graph

The square of r, r2or R2, is called the coefficient of determination. It measures the percentage of variation in the values of the dependent variable (in this case the mortality index) that can be explained by the independent variable.

Here are the equations for calculating the coefficients for the linear regression:
regression equations
Words of warning about Pearson correlation coefficients:

a) Data may be highly correlated but have a zero linear correlation coefficient.

graph

b) Outlier in the data set can have a large effect on the value of r and lead to erroneous conclusions. This does not mean that you should ignore outliers, but you should be aware of their effect on the correlation coefficients.

graph

Note: An alternatives to the pearson correlation coefficient is the spearman rank correlation coefficient (SPEARMAN( ref1,  ref2, )) which is a non-parametric approach and is often used when no linear relationship can be found in the data.

Significance of regression

Assumption of the linear regression model

Regression analysis makes the following assumptions:
A significant regression is no proof that these assumptions haven't been violated! We need to check them by plotting the residuals.
graph

 
Residual analysis is a good way to decide if a linear fit was a good choice. In a residual analysis, the differences for each data point between the true y-value and the predicted y-value as determined from the best-fit line are plotted for each x-value of the data points. If the linear fit was a good choice, then the scatter above and below the zero line should be about the same. If this analysis shows a bias (for example the residuals appear to be distributed in a specific pattern as x increases, as in above figure), another curve might be a better choice.
Example: Breast cancer as function of mean annual temperature (Breast Cancer.xlsx, Ch 8)

Correlation matrix and scatter plot matrix

Statplus provides tools to look at correlations of many parameters in a convenient way.

Multiple regression

We can extend our regression analysis to multiple parameters, we would fit a line with the following equation to our data:

y = bo + b1x1 + b2x2 + b3x3 +....
Again, we would use Excel's regression function to determine the coefficients

The ANOVA table helps to choose between two hypotheses:
The task is then to find out which variables are good predicters, which can be done in the following way:

Example: daily max values for the ozone data set


Resources: