Environmental Data Analysis BC ENV 3017

Statistics 2b

Confidence Intervals

Review of the Central Limit Theorem as discussed last time
We learned above that a sample average will approximately follow a normal distribution with mean µ and standard deviation σ/n^1/2.
We also know that in a normal distribution, about 95% of the time values fall within about 2 standard deviations of the mean.
That means that we can construct a confidence interval ranging from - 2*SE to + 2*SE. The term 95% confident means that we are confident that our procedure will 'capture' the value of µ 95% of the times that it is used.
The factor of 2 is just approximate and only valid if our sample size is large (n>25).
For large samples, we can calculate this factor (called the z-value) more precisely by using the Excel NORMINV function, it is 1.96. This factor will be different when we look at small samples. Demonstrate of Excel help feature with new functions.
The z-value is the point on a Normal curve such that the probability of a value being <z is equal to p.
after defining alpha as the probability for being outside the interval we can define the exact confidence interval as:

Explore the confidence interval tutorial

t-distribution and confidence intervals

So far we have assumed we know the standard deviation σ and that we are dealing with large n (n>25). When we substitute σ by the observed standard deviation SD then we cannot use the normal distribution anymore, but need to use the Student t-distribution. The t-distribution has one more parameter, the degrees of freedom (sample size n -1).

explore the t-distriution in the distribution tutorial

The factor of '2' we have so far used has to be replaced by a 't-value' derived from the t-distribution
we can then calculate the t confidence interval using the t-distribution:

confidence interval:

± t-value * SE

The t-value can be calculated using the Excel function TINV (alpha,n-1). For a 95% confidence interval and n=5 observations in a sample, for example, it would be TINV(0.05,4) = 2.776. Again, we would then state that with 95% confidence we think that our procedure captures the true mean in the confidence interval.

Experimental errors

All experiments are characterized by an experimental error.

There are two kinds of errors:

random errors or statistical errors

replicate measurements yield typically slightly different results caused by variability within the natural process and the instrument
results in a variation of the measurements around the ‘true’ (or ‘exact’) value
precision of an experiment refers to this kind of error
this is what we have been talking about so far

systematic errors

caused by not properly calibrated instrument, drifts etc.
results in a systematic difference between measured and ‘true’ value
accuracy of an experiment refers to this kind of error

In summary:

individual measurement = exact value + bias + chance error

Each measurement result should be given with its error. However, it is often very difficult to quantify the systematic error, and in most cases the given error is the statistical error only. This error only states how precise an experiment was and not how accurate it was.

Error reporting

Errors are reported as absolute or relative errors, for example:

ozone concentration at West Point, 8/3/1993, 14:00:

(125 ± 5) ppm or (125 ± 4%) ppm

The error can be a standard deviation, a SE (or 2*SE) or, e.g. a 95% confidence interval. You'll need to state what your error bars reflect!

Error Propagation

In many cases you need to calculate a value based on your measurement results using a formula. What is the SE of the derived number? For formulas that only include only simple mathematical operations the propagation of errors is relatively simple. The following rule approximates the error of the derived number:

addition, subtractions => absolute error = Square root (sum of (absolute errors)²)

multiplication, division => relative error = Square root (sum of (relative errors)²)

You can determine the error (D f) of a more complex function f (x, y, z, ....) by using the partial derivatives of the function and the errors of the individual variables (D x, D y, D z,....):

error propagation

Resources:

Freedman, D., Pisani, R., Purves, R., and Adhikari, A. (1991) Statistics. WW Norton & Company, New York, 2nd ed. 514pp.
Fisher, F.E. (1973) Fundamental Statistics Concepts. Canfield Press, San Francisco, 371 pp.
Berenson, M.L., Levine, D.M., and Rindskopf, D. (1988) Applied statistics - A first course. Prentice Hall, Englewood Cliffs, NJ, 557pp.
Lyons, L. (1991) A practical guide to data analysis for physical science students. Cambridge University Press, Cambridge, UK, 95p. (Barnard QC 33.L9 1991 c.1)
Hartwig, F., and Dearing B.E. (1979) Exploratory data anlaysis. Sage University Paper series on Quantitative Applications in the Social Sciences, 16, Sage Publcations, Beverly Hills and London, 83p. (Barnard HA 29.H257 c.1)
Jaffe, A.J., and Spirer, H.F. (1987) Misused statitistics. Popular statistics, 5. Marcel Dekker, Inc., New York, 236p. (Barnard HA 29.J29 1987)
Knoke, D., and Bohrnstedt, G.W. (1991) Basic social statistics. F.E. Peacock Publishers, Inc., Itasca, Illinois, 363p. (Barnard HA 29.K735 1991 c.2)
Levin, J., and Fox, A.J. (1994) Elementary statistics in social research. Harper Collins College Publishers, New York, 508p. (Barnard HA 29.L388 1994 c.1)
Welkowitz, J, Ewen, R.B., and Cohen, J. (1991) Introductory statistics in the behavioral sciences. Harcourt Brace & Company, Orlando, Florida, 391p.