Environmental Data Analysis BC ENV 3017

Statistics 1 - Descriptive statistics

Definitions

in descriptive statistics, we use various mathematical tools to summarize the values of a data set.
when the statistics involve only one variable we are talking about univariate statistics
a variable is a single characteristic of an object or event
there are quantitative (e.g. temperature, ozone) and qualitative (e.g gender) variables
discrete variables are quantitative variables that assume values from a defined list of numbers
continuous variables have values from an infinite range of possible values

The histogram

The histogram is graphical presentation of a list of data. To make the general shape of a histogram independent of how the bin sizes are selected, the Y-axis should be normalized to reflect ‘% per bin size’. With this ‘density scale’ on the vertical axis, the areas of the blocks come out in %, because the units on the horizontal axis cancel. The area under the histogram over an interval equals the percentage of cases in that interval. The total area under the histogram is 100%.

See example with distribution of family income in the United States (fig).

Make a histogram for batting average or salary in B&C file 'Baseball.xlsx', normalize it to % per binsize
determine maximum and minimim of the list

Measures of the center: mean, median, and mode

A typical list of a variable can be summarized by its average and standard deviation (s or SD)

Mean (Average) of a list = sum of entries/number of entries.

The average locates the center of a histogram, in the sense that the histogram balances when supported at the average (fig).

Half the area under a histogram lies to the left of the median, and half to the right. The median is another way to locate the center of a histogram (fig).

A 'compromise' between median and mean is the trimmed mean, which removes a certain percentage of numbers at upper and lower end of spectrum (works only in statplus).

The mode is the most frequently occurring, or repetitive, value in a list or corresponds to the peak in a histogram

There are other measures of the center, such as geometric mean and harmonic mean, which we are not going to use in this class.

geometric mean:

harmonic mean:

calculate mean and median of betting average and salary in 'Baseball.xlsx'

Measures of variability

A possibility to describe variability would be to look at the average distance from the average:

However, if the distribution is symmetric, the result may be 0.

The standard deviation s (or SD) measures the absolute distance from the average. Each number on a list is off the average by some amount. The standard deviation is a sort of average size for these amounts off. It is a measure of the variability of the data in the list

n-1, is the degree of freedom (the number of values in the list that can be chosen independently for a given mean), s² is called variance.

Roughly 68% of the entries on a list of numbers are within one s or SD of the average, and about 95% are within two SD’s of the average. This is so for many lists, but not all. It holds strictly only true for data that are normally distributed (see below).

Calculate manually mean and SD for the reliability column in the B&C file 'Baseball.xlsx', then use the functions 'mean' and 'stdev'. Make sure they are the same.

Distribution statistics

It makes always sense to create a chart of the distribution (histogram) when analyzing a data set.
The shape of the histogram can be described by terms such as Skewness (fig) and Kurtuosis
Measures of the distribution of a variable includes percentiles and quartiles.
The pth percentile is the value in a given distribution such that p percent of the distribution is either less than or equal to that value.
Quartiles are the values located at the 25th, 50th, and 75th percentile.
The interquartile range (IQR) is the difference between the 75th and 25th percentile.
The 'univariate statistics' tool in Statplus allows you to calculate all these.
Outliers are measured relative to the IQR (Fig)
Data can be summarized in boxplots.

concept tutorial 'Boxplots'
make box plots for columns in B&C file 'Baseball.xlsx'

Summary of Excercises

make a histogram for batting average or salary in B&C file 'Baseball.xlsx', normalize it to % per binsize
determine maximum and minimim of the list
calculate mean and median of betting average and salary in 'Baseball.xlsx'
Calculate manually mean and SD for the reliability column in the B&C file 'Baseball.xlsx', then use the functions 'mean' and 'stdev'. Make sure they are the same.
concept tutorial 'Boxplots'
make box plots for columns in B&C file 'Baseball.xlsx'

Relevant EXCEL functions:

HISTOGRAM, AVERAGE, MEDIAN, STDEV, KURT, SKEW, IQR, StatPlus Univariate tools

Resources:

Berk & Carey, chapter 4.
Freedman, D., Pisani, R., Purves, R., and Adhikari, A. (1991) Statistics. WW Norton & Company, New York, 2nd ed. 514pp.
Fisher, F.E. (1973) Fundamental Statistics Concepts. Canfield Press, San Francisco, 371 pp.
Berenson, M.L., Levine, D.M., and Rindskopf, D. (1988) Applied statistics - A first course. Prentice Hall, Englewood Cliffs, NJ, 557pp.