Environmental Data Analysis BC
ENV 3017
Hand entry of data logs
General Problem: Characterizing errors found in data sets.
ALL DATA SETS CONTAIN ERRORS!
- Sources of error:
- Errors in hand-written logs due to distracted technican
- Typos that occur when keying in data
- Data placed in wrong row or column of a table
- Data in wrong units (e.g. feet instead of meters)
- Null (or missing) values interpreted as data
- Sensor malfunctions
- Data transmission/copy errors
- ... and a million other ways things can go wrong
- Error rates of 1 error in every 10-20 rows of a table of data are
typical).
- Understand where your data comes from, in as much detail as you
can.
- Understand what the data values mean, so to look for anomalous
behavior.
- Be on the lookout for errors:
- dates and times that are out-of-sequence
- values that are out of the acceptable range of a parameter
- values that are stated to unusual precision
- repeated sequences of entries
- Perform consistency checks, where possible.
- What do you do if you found an error on the sheet?
- in case the error is obvious, for example a wrong date or time
in a
sequence
of data, correct it and highlight the cell or add a comment to the cell
- in case you are uncertain, remove the data point form the table
- make a comment in the spreadsheet about any chances you made
- The West
Point
Ozone Project Sheet 43
handout is a hand-written data log. In many data-collection projects a
technician is responsible for making routine measurements and writing
the
results down on log sheets like this one. (Actually, this particular
sheet
is fake - constructed for the purpose of illustration only). Look the
sheet
over, noting that it contains 5 columns of data:
- Date, in mm/dd/yyyy format
- Time, in twenty-four style
- Wind direction, in degress east of north
- Wind speed, in miles per hour
- Atmospheric ozone concentration, in parts per billion
Managing and curating data
- enter data ASAP after collection into spreadsheets
- don't forget comments, note any unusual events during data
collection
- proofread data, check for outliers related to data entry by
plotting up data, e.g. as time series
- include metadata ("data about data"), should include:
- name and contact info for person who collected them
- geographic info where data were collected
- aim of the study
- when you work on the data keep a tab on what you did, include
the date
- keep raw data separate from analyzed data and do not modify them
- do a carefully quality check of the data, add comments when you
decide to remove or modify data
- document your analysis steps carefully
- the goal is that 10 years from now another uninvolved
person can trac your steps
- make the file name meaningful so that a search of your desktop
can find it, include a date in the file name
- make backups of your files and store the backup at a different
location