Environmental Data Analysis BC
ENV 3017
Statistics 1 - Descriptive
statistics
Definitions
- in descriptive statistics,
we
use
various
mathematical tools to summarize the values of a data set.
- when the statistics involve
only
one
variable we are talking about univariate statistics
- a variable is a
single
characteristic
of an object or event
- there are quantitative
(e.g.
temperature, ozone) and qualitative (e.g gender) variables
- discrete variables
are
quantitative
variables that assume values from a defined list of numbers
- continuous variables
have
values
from an infinite range of possible values
The histogram
The histogram is graphical
presentation
of a list of data. To make the general shape of a histogram
independent
of how the bin sizes are selected, the Y-axis should be normalized
to reflect
‘% per bin size’. With this ‘density scale’ on the vertical
axis, the areas of the blocks come out in %, because the units on the
horizontal
axis cancel. The area under the histogram over an interval equals the
percentage
of cases in that interval. The total area under the histogram is 100%.
See example with distribution of
family income in the United States (fig).
- Make a
histogram for batting average or salary in B&C file
'Baseball.xlsx', normalize it to % per binsize
- determine
maximum
and
minimim of the list
Measures of the center: mean,
median,
and mode
A typical list of a variable
can be summarized by its average and standard deviation (s
or
SD)
Mean (Average) of a list
=
sum of entries/number of entries.
The average locates the center of
a histogram, in the sense that the histogram balances when supported at
the average (fig).
Half the area under a histogram
lies
to the left of the median, and half to the right. The median is
another way to locate the center of a histogram (fig).
A 'compromise' between median and mean is the trimmed mean,
which
removes
a certain percentage of numbers at upper and lower end of spectrum
(works
only in statplus).
The mode is the
most
frequently occurring, or repetitive, value in a list or corresponds to
the peak in a histogram
There are other measures of the center, such as geometric
mean and harmonic mean, which we are not going to use
in this class.
- calculate mean and median of betting
average and salary in 'Baseball.xlsx'
Measures of variability
A possibility to describe variability would be to look at the average
distance
from the average:

However, if the distribution is symmetric, the result may be 0.
The standard deviation s
(or SD) measures the absolute distance from the average. Each
number
on a list is off the average by some amount. The standard
deviation
is a sort of average size for these amounts off. It is a measure of the
variability
of the data in the list

n-1, is the degree
of
freedom (the number of values in the list that can be chosen
independently for a given mean),
s2 is called variance.
Roughly 68% of the entries on a
list
of numbers are within one s or SD of the average, and about 95% are
within
two SD’s of the average. This is so for many lists, but not all. It
holds
strictly only true for data that are normally distributed (see
below).
- Calculate
manually
mean
and
SD for the reliability column in the B&C file 'Baseball.xlsx',
then use the functions 'mean' and 'stdev'. Make sure they are the same.
Distribution statistics
- It makes always sense
to
create
a chart of the distribution (histogram) when analyzing a data set.
- The shape of the histogram
can
be described
by terms such as Skewness (fig) and Kurtuosis
- Measures of the distribution
of
a variable
includes percentiles and quartiles.
- The pth percentile
is
the value
in a given distribution such that p percent of the distribution is
either
less than or equal to that value.
- Quartiles are the
values
located
at the 25th, 50th, and 75th percentile.
- The interquartile range
(IQR)
is the difference between the 75th and 25th percentile.
- The 'univariate statistics'
tool
in
Statplus allows you to calculate all these.
- Outliers are measured
relative
to the IQR (Fig)
- Data can be summarized in boxplots.
- concept
tutorial
'Boxplots'
- make
box
plots
for columns in B&C file 'Baseball.xlsx'
Summary of Excercises
- make a
histogram for batting average or salary in B&C file
'Baseball.xlsx', normalize it to % per binsize
- determine
maximum and minimim of the list
- calculate mean and median of betting
average and salary in 'Baseball.xlsx'
- Calculate
manually
mean and SD for the reliability column in the B&C file 'Baseball.xlsx',
then use the functions 'mean' and 'stdev'. Make sure they are the same.
- concept
tutorial
'Boxplots'
- make
box
plots
for columns in B&C file 'Baseball.xlsx'
Relevant EXCEL functions:
- HISTOGRAM, AVERAGE, MEDIAN,
STDEV, KURT,
SKEW, IQR, StatPlus Univariate tools
Resources:
- Berk & Carey, chapter 4.
- Freedman, D., Pisani, R.,
Purves,
R., and Adhikari, A. (1991) Statistics. WW Norton & Company, New
York,
2nd ed. 514pp.
- Fisher, F.E. (1973)
Fundamental Statistics
Concepts. Canfield Press, San Francisco, 371 pp.
- Berenson, M.L., Levine, D.M.,
and
Rindskopf, D. (1988) Applied statistics - A first course. Prentice
Hall,
Englewood Cliffs, NJ, 557pp.