The paprica pipeline was designed to infer the genomic content and genomic characteristics of a set of 16S rRNA gene reads. To enable this the paprica database organizes this information by phylogeny for many of the completed genomes in Genbank. In addition to metabolic inference this provides an opportunity to explore how genome content and genomic characteristics are organized phylogenetically. The following is a brief analysis of some genomic features using the paprica database and R. If you aren’t familiar with the paprica database this exercise will also familiarize you with some of its content and its organization.
The paprica pipeline and database can be obtained from Github here. In this post I’ll be using the database associated with version 0.3.1. The necessary files from the bacteria database (one could also conduct this analysis on the much smaller archaeal database) can be read into R as such:## Read in the pathways associated with the terminal nodes on the reference tree path <- read.csv('paprica/ref_genome_database/bacteria/terminal_paths.csv', row.names = 1) path[is.na(path)] <- 0 ## Read in the data associated with all completed genomes in Genbank data <- read.csv('paprica/ref_genome_database/bacteria/genome_data.final.csv', row.names = 1) ## During database creation genomes with duplicate 16S rRNA genes were removed, ## so limit to those that were retained data <- data[row.names(data) %in% row.names(path),] ## "path" is ordered by clade, meaning it is in top to bottom order of the reference tree, ## however, "data" is not, so order it here data <- data[order(data$clade),]
One fun thing to do at this point is to look at the distribution of metabolic pathways across the database. To develop a sensible view it is best to cluster the pathways according to which genomes they are found in.## The pathway file in the database is binary, so we use Jaccard for distance library('vegan') path.dist <- vegdist(t(path), method = 'jaccard') # distance between pathways (not samples!) path.clust <- hclust(path.dist)
The heatmap function is a bit cumbersome for this large matrix, so the visualization can be made using the image function.## Set a binary color scheme image.col <- colorRampPalette(c('white', 'blue'))(2) ## Image will order matrix in ascending order, which is not what we want here! image(t(data.matrix(path))[rev(path.clust$order),length(row.names(path)):1], col = image.col, ylab = 'Genome', xlab = 'Pathway', xaxt = 'n', yaxt = 'n') box()
There are a couple of interesting things to note in this plot. First, we can see the expected distribution of core pathways present in nearly all genomes, and the interesting clusters of pathways that are unique to a specific lineage. For clarity row names have been omitted from the above plot, but from within R you can pull out the taxa or pathways that interest you easily enough. Second, there are some genomes that have very few pathways. There are a couple of possible reasons for this that can be explored with a little more effort. One possibility is that these are poorly annotated genomes, or at least the annotation didn’t associate many or any coding sequences with either EC numbers or GO terms – the two pieces of information Pathway-Tools uses to predict pathways during database construction. Another possibility is that these genomes belong to obligate symbionts (either parasites or beneficial symbionts). Obligate symbionts often have highly streamlined genomes and few complete pathways. We can compare the number of pathways in each genome to other genome characteristics for additional clues.
A reasonable assumption is that the number of pathways in each genome should scale with the size of the genome. Large genomes with few predicted pathways might indicate places where the annotation isn’t compatible with the pathway prediction methods.## Plot the number of pathways as a function of genome size plot(rowSums(path) ~ data$genome_size, ylab = 'nPaths', xlab = 'Genome size') ## Plot P. ubique as a reference point select <- grep('Pelagibacter ubique HTCC1062', data$organism_name) points(rowSums(path)[select] ~ data$genome_size[select], pch = 19, col = 'red')
That looks pretty good. For the most part more metabolic pathways were predicted for larger genomes, however, there are some exceptions. The red point gives the location of Pelagibacter ubique HTCC1062. This marine bacterium is optimized for life under energy-limited conditions. Among its adaptations are a highly efficient and streamlined genome. In fact it has the smallest genome of any known free-living bacterium. All the points below it on the x-axis are obligate symbionts; these are dependent on their host for some of their metabolic needs. There are a few larger genomes that have very few (or even no) pathways predicted. These are the genomes with bad, or at least incompatible annotations (or particularly peculiar biochemistry).
The other genome parameters in paprica are the number of coding sequences identified (nCDS), the number of genetic elements (nge), the number of 16S rRNA gene copies (n16S), GC content (GC), and phi; a measure of genomic plasticity. We can make another plot to show the distribution of these parameters with respect to phylogeny.## Grab only the data columns we want data.select <- data[,c('n16S', 'nge', 'ncds', 'genome_size', 'phi', 'GC')] ## Make the units somewhat comparable on the same scale, a more ## careful approach would log-normalize some of the units first data.select.norm <- decostand(data.select, method = 'standardize') data.select.norm <- decostand(data.select.norm, method = 'range') ## Plot with a heatmap heat.col <- colorRampPalette(c('blue', 'white', 'red'))(100) heatmap(data.matrix(data.select.norm), margins = c(10, 20), col = heat.col, Rowv = NA, Colv = NA, scale = NULL, labRow = 'n', cexCol = 0.8)
Squinting at this plot it looks like GC content and phi are potentially negatively correlated, which could be quite interesting. These two parameters can be plotted to get a better view:plot(data.select$phi ~ data.select$GC, xlab = 'GC', ylab = 'phi')
Okay, not so much… but I think the pattern here is pretty interesting. Above a GC content of 50 % there appears to be no relationship, but these parameters do seem correlated for low GC genomes. This can be evaluated with linear models for genomes above and below 50 % GC.gc.phi.above50 <- lm(data.select$phi[which(data.select$GC >= 50)] ~ data.select$GC[which(data.select$GC >= 50)]) gc.phi.below50 <- lm(data.select$phi[which(data.select$GC < 50)] ~ data.select$GC[which(data.select$GC < 50)]) summary(gc.phi.above50) summary(gc.phi.below50) plot(data.select$phi ~ data.select$GC, xlab = 'GC', ylab = 'phi', type = 'n') points(data.select$phi[which(data.select$GC >= 50)] ~ data.select$GC[which(data.select$GC >= 50)], col = 'blue') points(data.select$phi[which(data.select$GC < 50)] ~ data.select$GC[which(data.select$GC < 50)], col = 'red') abline(gc.phi.above50, col = 'blue') abline(gc.phi.below50, col = 'red') legend('bottomleft', bg = 'white', legend = c('GC >= 50', 'GC < 50'), col = c('blue', 'red'), pch = 1)
As expected there is no correlation between genomic plasticity and GC content for the high GC genomes (R2 = 0) and a highly significant correlation for the low GC genomes (albeit with weak predictive power; R2 = 0.106, p = 0). So what’s going on here? Low GC content is associated with particular microbial lineages but also with certain ecologies. The free-living low-energy specialist P. ubique HTCC1062 has a low GC content genome for example, as do many obligate symbionts regardless of their taxonomy (I don’t recall if it is known why this is). Both groups are associated with a high degree of genomic modification, including genome streamlining and horizontal gene transfer.
I’m going to go way of the normal track here and do a bit of social commentary. I heard a radio piece on my drive home yesterday about the challenge of paying court and legal fees for low income wage earners. This can trap those guilty of minor offenses (like a traffic infraction) in a cycle of jail and debt that is difficult to break out of. It never made sense to me that financial penalties – which by their nature are punitive – don’t scale with income, as is common in some European countries. I decided to try and visualize how the weight of a penalty for say, a speeding ticket, scales with income. It was tempting to try and scale up, i.e. what is the equivalent of $300 for someone earning $X? I decided however, that it would be more informative to try and scale down.
In this scenario the penalty is $300. Someone earning the federal minimum wage and working 40 hours a week makes $15,080/year, so the $300 penalty is roughly 2 % of annual income. So be it, perhaps that’s a fair penalty. But what would be the equivalent if the same offender earned more? A private/seaman/airman who has just joined the military earns roughly $18,561/year. Paying the same ticket (and I know from experience that the military pays a lot of them) would equate to the minimum wage earner paying $243.72. A graduate student fortunate enough to get a stipend (and own a car) might earn $25,000/year. Paying the same ticket would be equivalent to the lowest wage earner paying $180.96, and down it goes along the income scale. If LeBron James, who earned $77.2 million last year in salary and endorsements (according to Forbes), got the ticket, the penalty would be equivalent to the lowest income wage earner paying $0.06. Salary data came from a variety of sources, including here and here. Salaries marked with an asterisk in the plot above are medians from these sources.
The skin of the Earth is the color of tar,
Ridged, freshly healed like the seams of a scar.
Through salt-spattered sky, a gray-winged gull sails;
Steam gently rises, the island exhales.
A power plant rests on porous basalt,
In spaces beneath, a dark final vault.
Carbon is cached with a strong crystal lock,
Ashes to ashes, rock back to rock.
In a First, Iceland Power Plant Turns Carbon Emissions to Stone, K. Krajick, Lamont-Doherty Earth Observatory
Rapid carbon mineralization for permanent disposal of anthropogenic carbon dioxide emissions, Matter et al., Science
Scientists Turn Carbon Dioxide Emissions into Stone, Magill, Climate Central
This is one in a series of posts by Katherine Allen, a researcher in geochemistry and paleoclimate at the School of Earth & Climate Sciences at the University of Maine.