Analyzing your Data

We can touch upon only a very small subset of data analysis tools and types of graphs in this brief lecture. You may need to use a specific tool for your project; please discuss this with your advisor and research mentor!

Language note: the word “data” is plural. For example, it is wrong to say (and write) “this data shows that…”. You should say “these data show that…”. If you want to refer to a collection of data, you can say “this dataset shows that…”.

Organizing and Understanding Your Data

Recording, Summarizing, and Curating Your Data

The data you have gathered (or will gather) for your thesis is likely to come in many different forms depending on your individual project.  However, there are some guidelines that are broadly applicable to everyone.

At some point you are likely to enter your data from its original format (field forms, lab notebook, etc.) into a computer spreadsheet.  Think carefully about the best way to arrange your data and keep the raw data together in a master file.   Give this file a clear an obvious name.  You will likely make several modifications to this file (which you should save as different files) but you should always have an unaltered master file with just the raw data.

As you work with your data, you will manipulate it in several ways.  You will almost certainly make summaries (calculating means, frequencies, etc.), examine subsets of data, and manipulate the data to accommodate particular kinds of analyses.  You will make your life remarkably easier if you keep a log of your manipulations and analyses.  If you spend an hour working with your data, take a few minutes to journal what you have done (either in a notebook on in a dedicated computer file).  It is important to record what you did, why you did it, what the results were, and what new files you generated as result.  Since you will spend a significant amount of time with your data over the course of writing your thesis, taking a few minutes to record your incremental progress will save you time down the line.

Some specific points:

Frequently used basic statistical analyses

Graphics

Types of plots (This is not an exhaustive list)

  1. Y versus X scatter plot: simple line or symbol plot.
  2. Time series: data points are plotted versus time
  3. Linear regression: scatter plot plus best fitting trend line
  4. Moving average: data are averaged in blocks around a central point, (for example 10 points on either side of a given point). Makes the most sense for time series data with large amounts of variability. Problems: data points at the very end and start of the time series cannot be included.
  5. Residual plot: (a fitted trend line, e.g., from a linear regression, is subtracted out, so that only the differences between the trend and the data are plotted).
  6. Pie chart: good for displaying data that should add up to 100%.
  7. 3D plots: if you have measurements that depend on two other variables, you could plot the measurements as ‘heights’ (z values) on a ‘landscape’ of the variables (x and y values).
  8. Log-log or semi-log plot: used for displaying data with large ranges in the numbers (for example, data points that range from 1 to 1000). Problem: can obscure serious errors in the data.

Characteristics of Good Plots, Figures, and Tables

A key feature of good scientific communication is making good figures and plots. Many more readers will see your figures and plots than will ever read the text of the paper.

Avoid becoming a graphical sinner!

Technical issues

Where to find help

Resources