Environmental Data Analysis BC
ENV 3017
Statistics 4:
Correlations
and Linear Regressions
Linear regression estimates a linear equation that describes the
relationship between two variables, correlation measures the strength
of that relationship.
Simple linear regression
- regression tutorial
- regression is used to
analyze the predictive relationship between a set of variables, called predictor variables, and another
variable, called the dependent
variable.
- in simple regression,
we have only two variables,, usually designated as X (the predictor variable) and Y (the dependent variable).
- one type of regression is linear regression in which the value of
the dependent variable is estimated by a linear combination of values
of the predictor (or independent) variables.
- in simple linear regression we predict the value of Y using the
equation: y=a+bx
- y is the dependent variable
- x is the independent, or predictor variable
- a and b are the coefficients, intercept and slope (=rise/run)
- residuals are the gaps between the predicted curve and the data
points (in vertical direction) Fig
- the idea of the linear least
square regression is to find the coefficients by minimizing the
sum of the squared residuals
- sum of square residuals = sum
(oberved
Yi - prediced Yi)2 = sum (Yi
- (a+bXi))2
- REGRESSION is an
Excel function
in the data analysis tool pack that calculates all the coefficients (as
well as the statistics of the
correlation)
Pearson correlation
coefficient
The (Pearson) correlation
coefficient r is a measure of the strength of a linear
relationship. It
ranges from -1 to +1. Some examples:
Negative Linear Relationship
Positive linear relationship
No correlation
The square of r, r2or R2, is called the coefficient of determination. It
measures the percentage of variation in the values of the dependent
variable (in this case the mortality index) that can be explained by
the independent variable.
Here are the equations for calculating the coefficients for the linear
regression:
Words of warning about Pearson correlation coefficients:
a) Data may be highly correlated but have a zero linear correlation
coefficient.
b) Outlier in the data set can have a large effect on the value of
r and lead to erroneous conclusions. This does not mean that
you should ignore outliers, but you should be aware of their effect on
the correlation coefficients.
Note: An alternatives to the pearson correlation coefficient is the
spearman
rank correlation coefficient (SPEARMAN( ref1, ref2, )) which is a
non-parametric approach and is often used when no linear relationship
can
be found in the data.
Significance of regression
- could it happen that the correlation occurred by chance only? The
correlation
coefficient r is not a good measure of this. If there is no correlation
between two parameters, then the regression line should have the slope
close to 0.
- in
the framework of the confidence interval we could determine the 95%
confidence interval of the slope and determine if 0 is in it or not
- alternatively, we can determine whether a significant simple
relationship
between X and
Y exists by testing whether the slope could be equal to zero using the
hypothesis framework:
- Ho: The coefficient (slope) of the predictor
variables = 0
- Ha: The coefficient is different from 0
- If this hypothesis
is rejected, we can conclude that there is evidence of a simple linear
relationship. We can use the t-statistic to determine whether the slope
is significant or not
- t - statistic = slope / SE of the slope
- degrees of freedom: n-2
- Excel's REGESSION function gives us the answers in both frameworks
Assumption of the linear regression model
Regression analysis makes the following assumptions:
- there must be a linear relationship between the variables
- the residuals are normally distributed with a mean of 0
- the residuals have constant variance (square of standard devation)
- the rsiduals are independent from each other
A significant regression is no proof that these assumptions haven't
been violated! We need to check them by plotting the residuals.
Residual analysis is a good way to decide if a linear fit
was
a good choice. In a residual analysis, the differences for each data
point
between the true y-value and the predicted y-value as determined from
the
best-fit line are plotted for each x-value of the data points. If the
linear
fit was a good choice, then the scatter above and below the zero line
should
be about the same. If this analysis shows a bias (for example the
residuals
appear to be distributed in a specific pattern as x increases, as in
above
figure), another curve might be a better choice.
Example: Breast cancer as function
of mean annual temperature (Breast Cancer.xlsx, Ch 8)
Correlation matrix and scatter
plot matrix
Statplus provides tools to
look at correlations of many parameters in a
convenient way.
- the correlation matrix gives Pearson correlation coefficients r
and their P-values
- the scatter plot matrix combines scatter plots
of all varaibles plotted against each other in a compact way
Multiple regression
We can extend our regression
analysis to multiple parameters, we would
fit a line with the following equation to our data:
y = bo + b1x1 + b2x2
+ b3x3 +....
Again,
we
would use Excel's regression
function to determine the coefficients
The
ANOVA table helps to choose between two hypotheses:
- Ho: The coefficients of all predictor variables = 0
- Ha: At least one of the coefficients is different from
0
The task is then to find out
which
variables are good predicters, which can be done in the following way:
- eliminate the least
significant
predictor if it is not significant
- refit the model
- repeat the above steps
until
all predictors are significant
Example: daily max values for the ozone data set
Resources:
- See chaper 8 in Berk &
Carey
- Freedman, D., Pisani, R.,
Purves,
R., and Adhikari, A. (1991) Statistics. WW Norton & Company, New
York,
2nd ed. 514pp.
- Fisher, F.E. (1973)
Fundamental
Statistics
Concepts. Canfield Press, San Francisco, 371 pp.
- Berenson, M.L., Levine, D.M.,
and
Rindskopf, D. (1988) Applied statistics - A first course. Prentice
Hall,
Englewood Cliffs, NJ, 557pp.
- Prothero, W. A. (2000) Introduction
to
Geological
Data Analysis 3-1 -3-13pp.