Environmental Data Analysis EESC BC 3017
Histograms/Boxplots
Definitions
-
in descriptive statistics, we
use various mathematical tools to summarize the values of a data set.
-
when the statistics involve only one
variable we are talking about univariate statistics
-
a variable is a single characteristic
of an object or event
-
there are quantitative (e.g.
temperature, ozone) and qualitative (e.g gender) variables
-
discrete variables are quantitative
variables that assume values from a defined list of numbers
-
continuous variables have values
from an infinite range of possible values
The histogram
The histogram is graphical presentation
of a list of data. To make the general shape of a histogram independent
of how the bin sizes are selected, the Y-axis can be normalized
to reflect
‘% per bin size’. With this ‘density scale’ on the vertical
axis, the areas of the blocks come out in %, because the units on the horizontal
axis cancel. The area under the histogram over an interval equals the percentage
of cases in that interval. The total area under the histogram is 100%.
See example with distribution of
family income in the United States (fig).
Measures of the center: mean, median,
and mode
A typical list of a variable
can be summarized by its average and standard deviation (s or
SD)
Mean (Average) of a list =
sum of entries/number of entries.
The average locates the center of
a histogram, in the sense that the histogram balances when supported at
the average (fig).
Half the area under a histogram lies
to the left of the median, and half to the right. The median is
another way to locate the center of a histogram (fig).
A 'compromise' between median and mean is the trimmed mean, which
removes
a certain percentage of numbers at upper and lower end of spectrum (works
only in statplus).
The mode is the most
frequently occurring, or repetitive, value in an array or range of data
There are other measures of the center, such as geometric
mean and harmonic mean, which we are not going to talk
about in this class.
Distribution statistics
-
It makes always sense to create
a chart of the distribution (histogram) when analyzing a data set.
-
The shape of the histogram can be described
by terms such as Skewness (fig)
and Kurtosis
-
Kurtosis characterizes the relative
peakedness or flatness of a distribution compared with the normal distribution.
Positive kurtosis indicates a relatively peaked distribution. Negative
kurtosis indicates a relatively flat distribution.
-
Skewness characterizes the degree
of asymmetry of a distribution around its mean. Positive skewness indicates
a distribution with an asymmetric tail extending toward more positive values.
Negative skewness indicates a distribution with an asymmetric tail extending
toward more negative values.
-
Measures of the distribution of a variable
includes percentiles and quartiles.
-
The pth percentile is the value
in a given distribution such that p percent of the distribution is either
less than or equal to that value.
-
Quartiles are the values located
at the 25th, 50th, and 75th percentile.
-
The interquartile range (IQR)
is the difference between the 75th and 25th percentile.
-
The 'univariate statistics' tool in
Statplus allows you to calculate all these.
-
Outliers are measured relative
to the IQR (Fig)
-
Data can be summarized in boxplots.
-
concept tutorial
'Boxplots'
-
make box plots
for columns in B&C file 'Base.xls'
Relevant EXCEL functions:
-
HISTOGRAM, AVERAGE, MEDIAN, STDEV, KURT,
SKEW, IQR, StatPlus Univariate tools
Resources:
Freedman, D., Pisani, R., Purves,
R., and Adhikari, A. (1991) Statistics. WW Norton & Company, New York,
2nd ed. 514pp.
Fisher, F.E. (1973) Fundamental Statistics
Concepts. Canfield Press, San Francisco, 371 pp.
Berenson, M.L., Levine, D.M., and
Rindskopf, D. (1988) Applied statistics - A first course. Prentice Hall,
Englewood Cliffs, NJ, 557pp.