Environmental Data Analysis BC ENV 3017

Statistics 1 - Descriptive statistics

Definitions

The histogram

The histogram is graphical presentation of a list of data. To make the general shape of a histogram independent of how the bin sizes are selected, the Y-axis should be normalized to reflect ‘% per bin size’. With this ‘density scale’ on the vertical axis, the areas of the blocks come out in %, because the units on the horizontal axis cancel. The area under the histogram over an interval equals the percentage of cases in that interval. The total area under the histogram is 100%.

See example with distribution of family income in the United States (fig).

Measures of the center: mean, median, and mode

A typical list of a variable can be summarized by its average and standard deviation (s or SD)

Mean (Average) of a list = sum of entries/number of entries.

The average locates the center of a histogram, in the sense that the histogram balances when supported at the average (fig).

Half the area under a histogram lies to the left of the median, and half to the right. The median is another way to locate the center of a histogram (fig).

A 'compromise' between median and mean is the trimmed mean, which removes a certain percentage of numbers at upper and lower end of spectrum (works only in statplus).

The mode is  the most frequently occurring, or repetitive, value in a list or corresponds to the peak in a histogram

There are other measures of the center, such as geometric mean and  harmonic mean, which we are not going to use in this class.

Geometri mean

Measures of variability

A possibility to describe variability would be to look at the average distance from the average:

However, if the distribution is symmetric, the result may be 0.

The standard deviation s (or SD) measures the absolute  distance from the average. Each number on a list is off the average by some amount. The standard deviation  is a sort of average size for these amounts off. It is a measure of the variability of the data in the list

 n-1, is the degree of freedom (the number of values in the list that can be chosen independently for a given mean), s2 is called variance.

Roughly 68% of the entries on a list of numbers are within one s or SD of the average, and about 95% are within two SD’s of the average. This is so for many lists, but not all. It holds strictly only true for data that are normally distributed (see below).

Distribution statistics

Summary of Excercises


Relevant EXCEL functions:

Resources: