Linear regression estimates a linear equation that describes the relationship between two variables, correlation measures the strength of that relationship.

- regression tutorial

- regression is used to
analyze the predictive relationship between a set of variables, called predictor variables, and another
variable, called the dependent
variable.

- in simple regression,
we have only two variables,, usually designated as X (the predictor variable) and Y (the dependent variable).

- one type of regression is linear regression in which the value of
the dependent variable is estimated by a linear combination of values
of the predictor (or independent) variables.

- in simple linear regression we predict the value of Y using the
equation: y=a+bx

- y is the dependent variable
- x is the independent, or predictor variable
- a and b are the coefficients, intercept and slope (=rise/run)

- residuals are the gaps between the predicted curve and the data
points (in vertical direction) Fig

- the idea of the linear least square regression is to find the coefficients by minimizing the sum of the squared residuals
- sum of square residuals = sum
(oberved
Y
_{i}- prediced Y_{i})^{2}= sum (Y_{i}- (a+bX_{i}))^{2}

- REGRESSION is an Excel function in the data analysis tool pack that calculates all the coefficients (as well as the statistics of the correlation)

**Negative Linear Relationship**

**Positive linear relationship**

**No correlation**

The square of r, r^{2}or R^{2}, is called the coefficient of determination. It
measures the percentage of variation in the values of the dependent
variable (in this case the mortality index) that can be explained by
the independent variable.

a) Data may be highly correlated but have a zero linear correlation
coefficient.

b) Outlier in the data set can have a large effect on the value of
r and lead to erroneous conclusions. This does not mean that
you should ignore outliers, but you should be aware of their effect on
the correlation coefficients.

- could it happen that the correlation occurred by chance only? The correlation coefficient r is not a good measure of this. If there is no correlation between two parameters, then the regression line should have the slope close to 0.
- in
the framework of the confidence interval we could determine the 95%
confidence interval of the slope and determine if 0 is in it or not

- alternatively, we can determine whether a significant simple relationship between X and Y exists by testing whether the slope could be equal to zero using the hypothesis framework:
- H
_{o}: The coefficient (slope) of the predictor variables = 0 - H
_{a}: The coefficient is different from 0

- If this hypothesis is rejected, we can conclude that there is evidence of a simple linear relationship. We can use the t-statistic to determine whether the slope is significant or not
- t - statistic = slope / SE of the slope
- degrees of freedom: n-2
- Excel's REGESSION function gives us the answers in both frameworks

- there must be a linear relationship between the variables

- the residuals are normally distributed with a mean of 0

- the residuals have constant variance (square of standard devation)

- the rsiduals are independent from each other

Example: Breast cancer as function of mean annual temperature (Breast Cancer.xlsx, Ch 8)Residual analysis is a good way to decide if a linear fit was a good choice. In a residual analysis, the differences for each data point between the true y-value and the predicted y-value as determined from the best-fit line are plotted for each x-value of the data points. If the linear fit was a good choice, then the scatter above and below the zero line should be about the same. If this analysis shows a bias (for example the residuals appear to be distributed in a specific pattern as x increases, as in above figure), another curve might be a better choice.

- the correlation matrix gives Pearson correlation coefficients r and their P-values
- the scatter plot matrix combines scatter plots
of all varaibles plotted against each other in a compact way

y = b

Again,
we
would use Excel's regression
function to determine the coefficients

The ANOVA table helps to choose between two hypotheses:

Example: daily max values for the ozone data set

The ANOVA table helps to choose between two hypotheses:

- H
_{o}: The coefficients of all predictor variables = 0 - H
_{a}: At least one of the coefficients is different from 0

- eliminate the least significant predictor if it is not significant
- refit the model
- repeat the above steps
until
all predictors are significant

Example: daily max values for the ozone data set

Resources:

- See chaper 8 in Berk & Carey
- Freedman, D., Pisani, R., Purves, R., and Adhikari, A. (1991) Statistics. WW Norton & Company, New York, 2nd ed. 514pp.
- Fisher, F.E. (1973) Fundamental Statistics Concepts. Canfield Press, San Francisco, 371 pp.
- Berenson, M.L., Levine, D.M., and Rindskopf, D. (1988) Applied statistics - A first course. Prentice Hall, Englewood Cliffs, NJ, 557pp.
- Prothero, W. A. (2000) Introduction to Geological Data Analysis 3-1 -3-13pp.