Environmental Data Analysis BC ENV 3017

Statistics 4: Correlations and Linear Regressions

Linear regression estimates a linear equation that describes the relationship between two variables, correlation measures the strength of that relationship.

Simple linear regression

regression tutorial
regression is used to analyze the predictive relationship between a set of variables, called predictor variables, and another variable, called the dependent variable.
in simple regression, we have only two variables,, usually designated as X (the predictor variable) and Y (the dependent variable).
one type of regression is linear regression in which the value of the dependent variable is estimated by a linear combination of values of the predictor (or independent) variables.
in simple linear regression we predict the value of Y using the equation: y=a+bx
y is the dependent variable
x is the independent, or predictor variable
a and b are the coefficients, intercept and slope (=rise/run)

residuals are the gaps between the predicted curve and the data points (in vertical direction) Fig
the idea of the linear least square regression is to find the coefficients by minimizing the sum of the squared residuals

sum of square residuals = sum (oberved Y_i - prediced Y_i)² = sum (Y_i - (a+bX_i))²

REGRESSION is an Excel function in the data analysis tool pack that calculates all the coefficients (as well as the statistics of the correlation)

Pearson correlation coefficient

The (Pearson) correlation coefficient r is a measure of the strength of a linear relationship. It ranges from -1 to +1. Some examples:

Negative Linear Relationship

Positive linear relationship

No correlation

The square of r, r²or R², is called the coefficient of determination. It measures the percentage of variation in the values of the dependent variable (in this case the mortality index) that can be explained by the independent variable.

Here are the equations for calculating the coefficients for the linear regression:

Words of warning about Pearson correlation coefficients:

a) Data may be highly correlated but have a zero linear correlation coefficient.

b) Outlier in the data set can have a large effect on the value of r and lead to erroneous conclusions. This does not mean that you should ignore outliers, but you should be aware of their effect on the correlation coefficients.

Note: An alternatives to the pearson correlation coefficient is the spearman rank correlation coefficient (SPEARMAN( ref1, ref2, )) which is a non-parametric approach and is often used when no linear relationship can be found in the data.

Significance of regression

could it happen that the correlation occurred by chance only? The correlation coefficient r is not a good measure of this. If there is no correlation between two parameters, then the regression line should have the slope close to 0.
in the framework of the confidence interval we could determine the 95% confidence interval of the slope and determine if 0 is in it or not
alternatively, we can determine whether a significant simple relationship between X and Y exists by testing whether the slope could be equal to zero using the hypothesis framework:

H_o: The coefficient (slope) of the predictor variables = 0
H_a: The coefficient is different from 0

If this hypothesis is rejected, we can conclude that there is evidence of a simple linear relationship. We can use the t-statistic to determine whether the slope is significant or not

t - statistic = slope / SE of the slope
degrees of freedom: n-2

Excel's REGESSION function gives us the answers in both frameworks

Assumption of the linear regression model

Regression analysis makes the following assumptions:

there must be a linear relationship between the variables
the residuals are normally distributed with a mean of 0
the residuals have constant variance (square of standard devation)
the rsiduals are independent from each other

A significant regression is no proof that these assumptions haven't been violated! We need to check them by plotting the residuals.

Residual analysis is a good way to decide if a linear fit was a good choice. In a residual analysis, the differences for each data point between the true y-value and the predicted y-value as determined from the best-fit line are plotted for each x-value of the data points. If the linear fit was a good choice, then the scatter above and below the zero line should be about the same. If this analysis shows a bias (for example the residuals appear to be distributed in a specific pattern as x increases, as in above figure), another curve might be a better choice.

Example: Breast cancer as function of mean annual temperature (Breast Cancer.xlsx, Ch 8)

Correlation matrix and scatter plot matrix

Statplus provides tools to look at correlations of many parameters in a convenient way.

the correlation matrix gives Pearson correlation coefficients r and their P-values
the scatter plot matrix combines scatter plots of all varaibles plotted against each other in a compact way

Multiple regression

We can extend our regression analysis to multiple parameters, we would fit a line with the following equation to our data:

y = b_o + b₁x₁ + b₂x₂ + b₃x₃ +....

Again, we would use Excel's regression function to determine the coefficients

The ANOVA table helps to choose between two hypotheses:

H_o: The coefficients of all predictor variables = 0
H_a: At least one of the coefficients is different from 0

The task is then to find out which variables are good predicters, which can be done in the following way:

eliminate the least significant predictor if it is not significant
refit the model
repeat the above steps until all predictors are significant

Example: daily max values for the ozone data set

Resources:

See chaper 8 in Berk & Carey
Freedman, D., Pisani, R., Purves, R., and Adhikari, A. (1991) Statistics. WW Norton & Company, New York, 2nd ed. 514pp.
Fisher, F.E. (1973) Fundamental Statistics Concepts. Canfield Press, San Francisco, 371 pp.
Berenson, M.L., Levine, D.M., and Rindskopf, D. (1988) Applied statistics - A first course. Prentice Hall, Englewood Cliffs, NJ, 557pp.
Prothero, W. A. (2000) Introduction to Geological Data Analysis 3-1 -3-13pp.