In order to generate a
pagoo object, the only mandatory data structure is a
data.frame which has to contain 3 basic columns. Let’s illustrate it with an toy dataset.
library(pagoo) # Load package tgz <- system.file('extdata', 'toy_data.tar.gz', package = 'pagoo') untar(tarfile = tgz, exdir = tempdir()) # Decompress example dataset files <- list.files(path = tempdir(), full.names = TRUE, pattern = 'tsv$|fasta$') # List files files
##  "/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//Rtmpv6kGaH/case_clusters_meta.tsv" ##  "/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//Rtmpv6kGaH/case_df.tsv" ##  "/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//Rtmpv6kGaH/case_orgs_meta.tsv" ##  "/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//Rtmpv6kGaH/organismA.fasta" ##  "/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//Rtmpv6kGaH/organismB.fasta" ##  "/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//Rtmpv6kGaH/organismC.fasta" ##  "/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//Rtmpv6kGaH/organismD.fasta" ##  "/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//Rtmpv6kGaH/organismE.fasta"
The file we need now is
case_df.tsv. Lets load it and see what’s its structure:
data_file <- grep("case_df.tsv", files, value = TRUE) data <- read.table(data_file, header = TRUE, sep = '\t', quote = '') head(data)
## gene org cluster annot ## 1 gene081 organismA OG001 Thioesterase superfamily protein ## 2 gene122 organismB OG001 Thioesterase superfamily ## 3 gene299 organismC OG001 Thioesterase superfamily protein ## 4 gene186 organismD OG001 Thioesterase superfamily protein ## 5 gene076 organismE OG001 Thioesterase superfamily ## 6 gene352 organismA OG002 Inherit from proNOG: Thioesterase
So it is a
data.frame with 4 columns. The first one with the name of each gene, the second one with the organism to which each gene belongs, the third one with the cluster to which each gene was assigned in the pangenome reconstruction, and the last one with annotation metadata for each gene. Of the 4 columns, the former 3 are required, and
pagoo will look for columns named “gene”, “org”, and “cluster”. More columns are optional, and you can add as many as you want (or none) to add metadata of each gene.
With only this data (even ignoring the fourth column, which is metadata), you can start working with pagoo:
pg <- pagoo(data = data)
but let’s continue adding more.
The next 2
.tsv files contains metadata for each cluster and for each organism, respectively, and are optional arguments.
# Organism metadata orgs_file <- grep("case_orgs_meta.tsv", files, value = TRUE) orgs_meta <- read.table(orgs_file, header = TRUE, sep = '\t', quote = '') head(orgs_meta)
## org sero country ## 1 organismA a Westeros ## 2 organismB b Westeros ## 3 organismC c Westeros ## 4 organismD a Essos ## 5 organismE b Essos
data.frame we have a column named
org which is mandatory in case you provide this argument. Other columns are metadata associated to each organism. Beware that organisms provided in this table (
orgs_meta$org) must coincide with the names provided in the
data$org field, in order to correctly map each variables.
Last file contains metadata associated with each cluster of orthologous:
# Cluster metadata clust_file <- grep("case_clusters_meta.tsv", files, value = TRUE) clust_meta <- read.table(clust_file, header = TRUE, sep = '\t', quote = '') head(clust_meta)
## cluster kegg cog ## 1 OG001 <NA> S ## 2 OG002 <NA> S ## 3 OG003 <NA> <NA> ## 4 OG004 <NA> D ## 5 OG005 K01990 V ## 6 OG006 <NA> V
Again, the column
clust_meta$cluster must contain the same elements as
data$cluster column to be able to map one into the other.
With all this data the
pagoo object will look much more complete. But you can still add sequence information to the pangenome, which makes it much more useful and interesting to work with.
In this made up dataset we have 5 organisms, so if you decide to add sequences to the pangenome you must provide them for all 5 organisms. The type of data needed is a DNA multifasta file for each organism, in which each sequence is a gene whose name can be mapped to the
data$gene column. You must first load the sequences into a
list, and name each list element as the organism provided in
data$org (as well as
list would look something like:
In the case of the example we are working on:
##  "list"
length(sq) # One list element per organism
##  5
names(sq) # Names are the same as in data$org
##  "organismA" "organismB" "organismC" "organismD" "organismE"
class(sq[]) # Class of each element of the list
##  "DNAStringSet" ## attr(,"package") ##  "Biostrings"
And we have a
DNAStringSet (Biostrings package). Now we can load a quite complete
pagoo object (you could still add more metadata to genes, clusters, or organisms):
p <- pagoo(data = data, # Required data org_meta = orgs_meta, # Organism's metadata cluster_meta = clust_meta, # Cluster's metadata sequences = sq) # Sequences
All the above stuff with preparing data and loading classes seems difficult and time-consuming, but in real life working datasets this will be rarely needed. We are explaining it here to provide full details about how the software works, but this package also provides functions to automatically read-in output files from pangenome reconstruction software into
pagoo, avoiding any formatting or manipulation of data.
pagoo supports input from roary (Page et al., 2015), which has been the standard and most cited software for pangenome reconstruction, and panaroo (Tonkin-Hill et al., 2020). It is worth noticing that as roary become the most used software in this field, other tools as PEPPAN, PRIATE and panaroo include scripts to convert their output to roary’s format. To work with
roary’s output, please refer to
?roary_2_pagoo documentation. Although panaroo also includes a tool for this, we provide a function to load their native output format, see
?panaroo_2_pagoo. For both functions you will only need the
.gff files used as input for roary of panaroo, and the
Also, we have created our own pangenome reconstruction software called pewit (Ferrés et al., still unpublished), which automatically generates a
pagoo-like object to perform downstream analyses. This object contain all the methods and fields
pagoo provides, plus a set of methods and fields exclusive to this software.
Other good pangenome reconstruction software already exists like PanX (Ding et al., 2018), micropan (Snipen & Liland, 2015), GET_HOMOLOGUES (Contreras-Moreira & Vinuesa, 2013), among others. We plan to provide support to some of them in the future.
After object creation, you may want to add new metadata given new information or as result of posterior analyses.
pagoo objects include a function to add columns of metadata to each gene, each cluster, or each organism. To illustrate it, we will add a new column to the
$organisms field named
host to add made up information about the host where each genome was isolated from.
host_df <- data.frame(org = p$organisms$org, host = c("Cow", "Dog", "Cat", "Cow", "Sheep")) p$add_metadata(map = "org", host_df) p$organisms
## DataFrame with 5 rows and 4 columns ## org sero country host ## <factor> <character> <character> <character> ## 1 organismA a Westeros Cow ## 2 organismB b Westeros Dog ## 3 organismC c Westeros Cat ## 4 organismD a Essos Cow ## 5 organismE b Essos Sheep
In order to allow
pagoo correct data mapping, the values in the first column of the metadata table should be available at
p$organisms$org, and its column header must also be named
As said, you can add
cluster metadata following the same idea.
Once loaded, it has two methods for saving and reloading to a new R session. The first one is by saving them as flat (text) files:
tmp <- paste(tempdir(), "pangenome", sep = "/") p$write_pangenome(dir = tmp) list.files(tmp, full.names = TRUE)
##  "/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//Rtmpv6kGaH/pangenome/clusters.tsv" ##  "/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//Rtmpv6kGaH/pangenome/data.tsv" ##  "/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//Rtmpv6kGaH/pangenome/organisms.tsv"
This creates a directory with 3 text files. The advantage of this approach is that you can analyze it outside R, the disadvantage is that from a reproducibility point of view reading text could be less stable since class or number precision can be lost, and also you can’t save the state of the object in any given time. Only available organisms/genes/clusters are saved, and if you reload the class using the tsv files, any previously dropped organisms/gene/cluster won’t be available any more. (For information about dropping/recovering organisms, see “Subset” tutorial).
If you want to save the object and continue working with it in other R session, we recommend to save them as R objects with the RDS methods provided:
rds <- paste(tempdir(), "pangenome.RDS", sep = "/") p$save_pangenomeRDS(file = rds) p2 <- load_pangenomeRDS(rds)
This method is more stable (compatible between
pagoo versions), secure (uses the same metadata classes, and precision isn’t lost), and convenient (the exact state of the object saved is restored, keeping dropped organisms/genes/clusters hidden, available to be recovered, and other object state configuration, e.g.
core_level, is also saved).