pagoo to provide basic but fundamental statistical analyses and visualizations out-of-the-box. Some of them are just wrappers of other packages functions (see
micropan R packages; cite them also if you use
pagoo), and graphics are made by
plotly packages. All these are very useful for initial data exploration, but also provide a lot of flexibility to interact with other R packages in order to make more complex plots and more robust analyses. Also an interactive
Shiny App can be easily deployed with a single function call, useful specially for those users that not very used to work with R code and prefer a UI, but also powerful for experienced users since it gives a general view of the data.
All this methods are embebbed to the
pagoo object, so is very easy to start working with pangenome data.
Start by loading the example dataset.
library(pagoo) # Load package toy_rds <- system.file('extdata', 'campylobacter.RDS', package = 'pagoo') p <- load_pangenomeRDS(toy_rds)
As you may have already noted, methods and visualization are different than data fields (see previous tutorials) in the way of calling them. As any R function, you must use parentheses to run them.
In this tutorial some, but not all, methods will be covered. Once you get the idea, just browse the documentation to see other available methods and what arguments they take.
Pairwise distances between gene abundance can be retrieved by using
$dist() method. By default, Bray-Curtis distance is computed. This is just a wrapper of
## 16244_6_6 16244_6_18 17059_2_16 17059_2_23 17059_2_27 17150_1_73 ## 16244_6_18 0.06594656 ## 17059_2_16 0.12122816 0.12500000 ## 17059_2_23 0.09622745 0.09632517 0.07632399 ## 17059_2_27 0.09245937 0.11310008 0.10311629 0.08230990 ## 17150_1_73 0.08203991 0.09034444 0.13624408 0.12275937 0.12999735 ## 17059_2_42 0.08920705 0.09927089 0.14532148 0.12682137 0.13706919 0.09518600
Other distances available include, for instance, Jaccard distance (see documentation). As this method require presence/absence data, and not abundance, you should use it with
binary = TRUE argument set.
p$dist(method = "jaccard", binary = TRUE)
A method to compute
stats package) over the panmatrix is provided. You can use generic functions to further analyze the results.
pca <- p$pan_pca() summary(pca)
## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## Standard deviation 8.1626 6.1103 5.5458 5.2924 4.6981 4.29398 2.779e-14 ## Proportion of Variance 0.3278 0.1837 0.1513 0.1378 0.1086 0.09072 0.000e+00 ## Cumulative Proportion 0.3278 0.5115 0.6629 0.8007 0.9093 1.00000 1.000e+00
ggplot2 package to produce customizable visualizations. While is not necessary to load
ggplot2 to plot pangenome data, is better since you can fully customize them. Also we will be using
patchwork to arrange plots in some cases to show, side by side, the default plot and a customized one.
The most basic plot is a pie chart to show core, shell, and cloud genome proportions.
# Basic pie1 <- p$gg_pie() + ggtitle("Default") # Customize with ggplot2 pie2 <- pie1 + ggtitle("Customized") + theme_bw(base_size = 15) + scale_fill_brewer(palette = "Blues") # Arrange (patchwork) and plot pie1 + pie2
Frequency barplots are one of the basic plots in pangenomics. They show how many clusters are in only 1 genome, how many in 2 genomes, ..and so on, until showing how many clusters are in all genomes (i.e. they are part of the coregenome).
A binary map show gene presence/absence patterns per genome. Each column is a cluster of orthologous genes, and each row an organism. The columns (clusters) are sorted by column sums, so core genes appear to the left, cloud genes to the right, and shell genes in the middle. This plot is useful, for example, to identify strains with abnormal accessory gene patterns which can represent true biological signatures like the presence of extrachromosomal elements or artifacts product of contamination during sequencing.
Pangenome curves typically illustrate the number of gene clusters that are subsequently discovered as more genomes are added to the dataset. If the pangenome is open, more and more accessory genes will be discovered as new genomes are added to the analysis and the size of the core genome will tend to decrease.
pagoo applies the the Power-law distribution to fit the pangenome size and the Exponential decay function to fit the core genome size.
Rarefaction curves are computed by first performing permutations with
$rarefact(), and then fitting a Power Law to pangenome counts, and an Exponential Decay Law to coregenome counts. The latter two operations are done by using
$cg_exp_decay_fit() methods, respectively (see documentation for more details).
$gg_curves() combines this methods to facilitate plotting.
You can add points and facet data by category, and add any other customization:
p$gg_curves() + ggtitle("Pangenome and Coregenome curves") + geom_point() + facet_wrap(~Category, scales = 'free_y') + theme_bw(base_size = 15) + scale_color_brewer(palette = "Accent")
A biplot for visualizing the first 2 principal components in a PCA is always useful at early stages of analysis to explore possible association of genomes based on gene content. This method uses
$pan_pca() method to perform the PCA, and allows you to use organism metadata to color the points.
## DataFrame with 7 rows and 8 columns ## org id strain year country host ## <factor> <character> <character> <integer> <character> <character> ## 1 16244_6_6 FR15 2008/170h 2008 France Human ## 2 16244_6_18 FR27 2012/185h 2012 France Human ## 3 17059_2_16 AR1 99/801 1999 Argentina Bovine ## 4 17059_2_23 AR8 04/875 2004 Argentina Bovine ## 5 17059_2_27 AR12 06/195 2006 Argentina Bovine ## 6 17150_1_73 CA1 001A-0374 2005 Canada Human ## 7 17059_2_42 TW6 1830 2008 Taiwan Human ## source accession ## <character> <character> ## 1 Feces ERS672247 ## 2 Blood ERS672259 ## 3 Prepuce ERS739235 ## 4 Fetus ERS739242 ## 5 VM ERS739246 ## 6 Blood ERS686652 ## 7 Blood ERS739261
Since we have metadata associated to each organism, we will use it to color the organisms according to the
p$gg_pca(colour = 'host', size = 4) + theme_bw(base_size = 15) + scale_color_brewer(palette = "Set2")
Last but not least, you can deploy a local
shiny app by just one line of code.
You can explore an online example of pagoo
shiny app for a complete set of 69 Campylobacter fetus genomes at the following url: https://microgenlab.shinyapps.io/pagoo_campylobacter/ . This method is currently not intended for very large and complex datasets, as it may render slow. To work with big pangenomes we recommend the use of the R command line, but for tens or few hundred of genomes it works fairly good if you have patience.