STATISTICS

STATISTICS 1

The histogram

The histogram is graphical presentation of a list of data. To make the general shape of a histogram independent of how the bin sizes are selected, the Y-axis should be normalized to reflect ‘% per bin size’. With this ‘density scale’ on the vertical axis, the areas of the blocks come out in %, because the units on the horizontal axis cancel. The area under the histogram over an interval equals the percentage of cases in that interval. The total area under the histogram is 100%.

See example with distribution of family income in the United States (fig).

Average, Median and Standard Deviation

A typical list of numbers can be summarized by its average and standard deviation (SD).

Average of a list = sum of entries/number of entries.

The average locates the center of a histogram, in the sense that the histogram balances when supported at the average (fig).

Half the area under a histogram lies to the left of the median, and half to the right. The median is another way to locate the center of a histogram (fig).

The SD measures distance from the average. Each number on a list is off the average by some amount. The SD is a sort of average size for these amounts off. It is a measure of the variability of the data in the list.

for small lists (<25 or so): n has to be replaced by n-1, that is the formula usually used in calculators and in EXCEL. The square of SD (SD²) is called variance.

Roughly 68% of the entries on a list of numbers are within one SD of the average, and about 95% are within two SD’s of the average. This is so for many lists, but not all. It holds strictly only true for data that are normally distributed (see below).

Relevant EXCEL functions:

HISTOGRAM, AVERAGE, MEDIAN, STDEV

STATISTICS 2

Errors

All experiments are characterized by an experimental error.

There are two kinds of errors:

random errors or statistical errors

replicate measurements yield typically slightly different results caused by variability within the natural process and the instrument
results in a variation of the measurements around the ‘true’ (or ‘exact’) value
precision of an experiment refers to this kind of error

systematic errors

caused by not properly calibrated instrument, drifts etc.
results in a systematic difference between measured and ‘true’ value
accuracy of an experiment refers to this kind of error

In summary:

individual measurement = exact value + bias + chance error

Each measurement result should be given with its error. However, it is often very difficult to quantify the systematic error, and in most cases the given error is the statistical error only. This error only states how precise an experiment was and not how accurate it was.

Standard Error

A repetition of an experiment will result in variable results. Imagine, for example, a multiple repetition of an ozone concentration measurement. What is the error of your average ozone concentration? The more measurements you do the more precise the average should be.

The standard error of the average (SE) is:

The SE is the error reported for the precision of an experiment. Errors are reported as absolute or relative errors, for example:

ozone concentration at West Point, 8/3/1993, 14:00:

(125 ± 5) ppm or (125 ± 4%) ppm

Error Propagation

In many cases you need to calculate a value based on your measurement results using a formula. What is the SE of the derived number? For formulas that only include only simple mathematical operations the propagation of errors is relatively simple. The following rule approximates the error of the derived number:

addition, subtractions -> add the absolute errors

multiplication, division -> add the relative errors

In case of more complicated formulas things get a bit more difficult, because you need to calculate derivatives. This is beyond the scope of this class.

The Normal Approximation for Data

The normal curve is described by the following formula:

The normal curve is symmetric about the average, and the total area under it is 100% (fig).

Standard units say how many SD’s a value is, above (+) or below (-) the average.

Many histograms have roughly the same shape as the normal curve.

If a list of numbers follows the normal curve, the percentage of entries falling in a given interval can be estimated by converting the interval to standard units and then finding the corresponding area under the normal curve. As already mentioned above, about 68% of the entries on a list of numbers are within one SD of the average, and about 95% are within two SD’s of the average.

Skewness
Skewness characterizes the degree of asymmetry of a distribution around its mean. Positive skewness indicates a distribution with an asymmetric tail extending towards more positive values. Negative skewness indicates a distribution with an asymmetric tail extending towards more negative values (fig).

The skewness typically ranges from 0 to ±1

Relevant EXCEL functions:

NORMDIST, NORMINV, SKEW

STATISTICS 3

t-statistics

Assume you want to find out if the a list of data is statistically different from a certain value. In this case you need to perform a significance test.

Let us assume, you want to calibrate the ozone detector. Your standard has an ozone conc. of 70ppb. Your measurements were: 78, 83, 68, 72, 88; average: 77.80, SD = 8.07. Is this significantly different from 70? In order to perform this test, we need to formulate the problem as a hypothesis:

Null-hypothesis (Ho): There is no difference between the expected value and the observed value. Observed value = expected value.
Alternative hypothesis (Ha): The expected and observed value are different. Observed value < or > expected value.

The t-test investigates if the null-hypothesis is true. It yields the probability (P-value) for the null-hypothesis being true. In other words, we are testing the following question: What is the probability that we obtained our observed average (which is different from the expected average) by chance?

The t-statistic is used to measure the difference between the data and the expected values in standard units. We are 2.2 SE units off the expected value! What is the probability that we are that far off (or more) from the expected value? We can use the normal curve to estimate this probability (fig)(fig).

However, Because we deal with small numbers of measurements, we need to use Student’s t curve, which looks very similar to the normal curve (fig). When you use the student curve, you will be asked to give the degrees of freedom:

degrees of freedom = number of measurements -1

In our example, we need to use the two tailed t-test (TDIST). We get a P-value of 0.097 or 9.7%. The cut-off value is typically chosen at 5%. If the probability is below 5%, we need to reject the null-hypothesis. In our case, we cannot reject the null hypothesis. that means there is no significant difference between our observed and expected value. By performing more measurements, we can improve the SE and more precisely check, if our instrument is systematically off.

This test can also be used to investigate if there are systematic differences between experiments, for example between samples obtained at different sites. Depending on certain constraints, you need to use one of the following t-tests that EXCEL offers:

t-Test: Paired two sample for Means
t-Test: Two-sample assuming unequal variances
(t-Test: Two-sample assuming equal variances)

Usually you do not know if the variances are equal or not. Just use the "unequal" test then. It will give you the right answer also in the case the variances are equal. The best way to understand these cases is to go through some examples. Download the following EXCEL file and work through the cases.

Relevant EXCEL functions:

T-TEST (Data Analysis menu), TDIST, TINV

STATISTICS 4

Lines

The slope is the rate at which y increases with x, along the line.

slope = rise/run

The intercept of a line is its height at x=0.

The graph of the equation y= ax + b is a straight line, with slope a and intercept b.

Correlation

The relationship between two variables can be represented visually by a scatter diagram. When the scatter diagram is tightly clustered around a line, there is a strong association between the variables (fig).

A scatter diagram can be summarized by means of five statistics:

average and SD of x-values

average and SD of y values

correlation coefficient r

In a series of scatter diagram with the same SDs, as r gets closer to ± 1 the points cluster more tightly around a line.

r = average of ((x in standard units) * (y in standard units))

The points cluster around the SD line. This line goes through the point of averages. When r is positive, the slope of the line is:

(SD of y)/(SD of x)

When r is negative, the slope is

-(SD of y)/(SD of x)

The correlation coefficient is a pure number, without units. It is not affected by:

interchanging the two variables
adding the same number to all values of one variable
multiplying all the values of one variable by the same positive number

Regression

The regression line is to a scatter diagram as the average is to a list. The regression line for y on x estimates the average value for y corresponding to each value of x. The regression line is also sometimes called the Least Square Fit (fig). It basically minimizes the square of the vertical difference between the line and the data points.

This line is different from the SD line! (fig)

If you switch the axis, the regression line will look differently

the slope of the regression line is:

r ^. (SD of y)/(SD of x)

there are formulas in EXCEL with which to calculate the slope and intercept of the regression line.

Significance of Regression lines

Could it happen that the correlation occurred by chance only? The correlation coefficient r is not a good measure of this. If there is no correlation between two parameters, then the regression line should have the slope close to 0.

We can determine whether a significant simple relationship between X and Y exists by testing whether the slope could be equal to zero. If this hypothesis is rejected, we can conclude that there is evidence of a simple linear relationship.

Let us write the t statistic for this case:

t - statistic = slope / standard error of the slope

degrees of freedom: n-2

‘a correlation is significant at the 95% level’, means that the area under the student curve to the left and right of ± t-statistic is less than 5%.

Relevant EXCEL functions:

TTEST, TDIST, CORRELATION, REGRESSION, LINEST

Resources:

Freedman, D., Pisani, R., Purves, R., and Adhikari, A. (1991) Statistics. WW Norton & Company, New York, 2nd ed. 514pp.

Fisher, F.E. (1973) Fundamental Statistics Concepts. Canfield Press, San Francisco, 371 pp.

Berenson, M.L., Levine, D.M., and Rindskopf, D. (1988) Applied statistics - A first course. Prentice Hall, Englewood Cliffs, NJ, 557pp.