Input From Scratch
In order to generate a pagoo
object, the only mandatory
data structure is a data.frame
which has to contain 3 basic
columns. Let’s illustrate it with an toy dataset.
First load pagoo
:
library(pagoo) # Load package
tgz <- system.file('extdata', 'toy_data.tar.gz', package = 'pagoo')
untar(tarfile = tgz, exdir = tempdir()) # Decompress example dataset
files <- list.files(path = tempdir(), full.names = TRUE, pattern = 'tsv$|fasta$') # List files
files
## [1] "/tmp/Rtmpo5qxP5/case_clusters_meta.tsv"
## [2] "/tmp/Rtmpo5qxP5/case_df.tsv"
## [3] "/tmp/Rtmpo5qxP5/case_orgs_meta.tsv"
## [4] "/tmp/Rtmpo5qxP5/organismA.fasta"
## [5] "/tmp/Rtmpo5qxP5/organismB.fasta"
## [6] "/tmp/Rtmpo5qxP5/organismC.fasta"
## [7] "/tmp/Rtmpo5qxP5/organismD.fasta"
## [8] "/tmp/Rtmpo5qxP5/organismE.fasta"
The file we need now is case_df.tsv
. Lets load it and
see what’s its structure:
data_file <- grep("case_df.tsv", files, value = TRUE)
data <- read.table(data_file, header = TRUE, sep = '\t', quote = '')
head(data)
## gene org cluster annot
## 1 gene081 organismA OG001 Thioesterase superfamily protein
## 2 gene122 organismB OG001 Thioesterase superfamily
## 3 gene299 organismC OG001 Thioesterase superfamily protein
## 4 gene186 organismD OG001 Thioesterase superfamily protein
## 5 gene076 organismE OG001 Thioesterase superfamily
## 6 gene352 organismA OG002 Inherit from proNOG: Thioesterase
So it is a data.frame
with 4 columns. The first one with
the name of each gene, the second one with the organism to which each
gene belongs, the third one with the cluster to which each gene was
assigned in the pangenome reconstruction, and the last one with
annotation metadata for each gene. Of the 4 columns, the former 3 are
required, and pagoo
will look for columns named “gene”,
“org”, and “cluster”. More columns are optional, and you can add as many
as you want (or none) to add metadata of each gene.
With only this data (even ignoring the fourth column, which is metadata), you can start working with pagoo:
pg <- pagoo(data = data)
but let’s continue adding more.
Including organism metadata
The next 2 .tsv
files contains metadata for each cluster
and for each organism, respectively, and are optional arguments.
# Organism metadata
orgs_file <- grep("case_orgs_meta.tsv", files, value = TRUE)
orgs_meta <- read.table(orgs_file, header = TRUE, sep = '\t', quote = '')
head(orgs_meta)
## org sero country
## 1 organismA a Westeros
## 2 organismB b Westeros
## 3 organismC c Westeros
## 4 organismD a Essos
## 5 organismE b Essos
On this data.frame
we have a column named
org
which is mandatory in case you provide this argument.
Other columns are metadata associated to each organism. Beware that
organisms provided in this table (orgs_meta$org
) must
coincide with the names provided in the data$org
field, in
order to correctly map each variables.
Including cluster metadata
Last file contains metadata associated with each cluster of orthologous:
# Cluster metadata
clust_file <- grep("case_clusters_meta.tsv", files, value = TRUE)
clust_meta <- read.table(clust_file, header = TRUE, sep = '\t', quote = '')
head(clust_meta)
## cluster kegg cog
## 1 OG001 <NA> S
## 2 OG002 <NA> S
## 3 OG003 <NA> <NA>
## 4 OG004 <NA> D
## 5 OG005 K01990 V
## 6 OG006 <NA> V
Again, the column clust_meta$cluster
must contain the
same elements as data$cluster
column to be able to map one
into the other.
With all this data the pagoo
object will look much more
complete. But you can still add sequence information to the pangenome,
which makes it much more useful and interesting to work with.
Including sequences
In this made up dataset we have 5 organisms, so if you decide to add
sequences to the pangenome you must provide them for all 5 organisms.
The type of data needed is a DNA multifasta file for each organism, in
which each sequence is a gene whose name can be mapped to the
data$gene
column. You must first load the sequences into a
list
, and name each list element as the organism provided
in data$org
(as well as org_meta$org
). The
list
would look something like:
- organism1
- gene1
- gene2
- …
- geneN
- organism2
- gene1
- gene2
- …
- geneM
- organism3
- gene1
- gene2
- …
- geneP
- …
In the case of the example we are working on:
# Sequences
fasta_files <- grep("[.]fasta", files, value = TRUE) # List fasta files
names(fasta_files) <- sub('[.]fasta', '', basename(fasta_files)) # Name them
# Read fasta files with Biostrings:
library(Biostrings)
sq <- lapply(fasta_files, readDNAStringSet)
class(sq) # Is list?
## [1] "list"
length(sq) # One list element per organism
## [1] 5
names(sq) # Names are the same as in data$org
## [1] "organismA" "organismB" "organismC" "organismD" "organismE"
class(sq[[1]]) # Class of each element of the list
## [1] "DNAStringSet"
## attr(,"package")
## [1] "Biostrings"
And we have a list
of DNAStringSet
(Biostrings package). Now we can load a quite complete
pagoo
object (you could still add more metadata to genes,
clusters, or organisms):
p <- pagoo(data = data, # Required data
org_meta = orgs_meta, # Organism's metadata
cluster_meta = clust_meta, # Cluster's metadata
sequences = sq) # Sequences
Input From Pangenome Reconstruction Software
All the above stuff with preparing data and loading classes seems
difficult and time-consuming, but in real life working datasets this
will be rarely needed. We are explaining it here to provide full details
about how the software works, but this package also provides functions
to automatically read-in output files from pangenome reconstruction
software into pagoo
, avoiding any formatting or
manipulation of data.
Currently pagoo
supports input from roary (Page et al.,
2015), which has been the standard and most cited software for pangenome
reconstruction, and panaroo (Tonkin-Hill
et al., 2020). It is worth noticing that as roary become the most used
software in this field, other tools as PEPPAN, PRIATE and panaroo
include scripts to convert their output to roary’s format. To work with
roary
’s output, please refer to ?roary_2_pagoo
documentation. Although panaroo also includes a tool for this, we
provide a function to load their native output format, see
?panaroo_2_pagoo
. For both functions you will only need the
.gff
files used as input for roary of panaroo, and the
gene_presence_absence.csv
file.
Also, we have created our own pangenome reconstruction software
called pewit (Ferrés et
al., still unpublished), which automatically generates a
pagoo
-like object to perform downstream analyses. This
object contain all the methods and fields pagoo
provides,
plus a set of methods and fields exclusive to this software.
Other good pangenome reconstruction software already exists like PanX (Ding et al., 2018), micropan (Snipen & Liland, 2015), GET_HOMOLOGUES (Contreras-Moreira & Vinuesa, 2013), among others. We plan to provide support to some of them in the future.
Adding Metadata After Object Creation
After object creation, you may want to add new metadata given new
information or as result of posterior analyses. pagoo
objects include a function to add columns of metadata to each gene, each
cluster, or each organism. To illustrate it, we will add a new column to
the $organisms
field named host
to add made up
information about the host where each genome was isolated from.
host_df <- data.frame(org = p$organisms$org, host = c("Cow", "Dog", "Cat", "Cow", "Sheep"))
p$add_metadata(map = "org", host_df)
p$organisms
## DataFrame with 5 rows and 4 columns
## org sero country host
## <factor> <character> <character> <character>
## 1 organismA a Westeros Cow
## 2 organismB b Westeros Dog
## 3 organismC c Westeros Cat
## 4 organismD a Essos Cow
## 5 organismE b Essos Sheep
In order to allow pagoo
correct data mapping, the values
in the first column of the metadata table should be available at
p$organisms$org
, and its column header must also be named
org
.
As said, you can add gene
or cluster
metadata following the same idea.
Saving and Loading pagoo
objects
Once loaded, it has two methods for saving and reloading to a new R session. The first one is by saving them as flat (text) files:
tmp <- paste(tempdir(), "pangenome", sep = "/")
p$write_pangenome(dir = tmp)
list.files(tmp, full.names = TRUE)
## [1] "/tmp/Rtmpo5qxP5/pangenome/clusters.tsv"
## [2] "/tmp/Rtmpo5qxP5/pangenome/data.tsv"
## [3] "/tmp/Rtmpo5qxP5/pangenome/organisms.tsv"
This creates a directory with 3 text files. The advantage of this approach is that you can analyze it outside R, the disadvantage is that from a reproducibility point of view reading text could be less stable since class or number precision can be lost, and also you can’t save the state of the object in any given time. Only available organisms/genes/clusters are saved, and if you reload the class using the tsv files, any previously dropped organisms/gene/cluster won’t be available any more. (For information about dropping/recovering organisms, see “Subset” tutorial).
If you want to save the object and continue working with it in other R session, we recommend to save them as R objects with the RDS methods provided:
rds <- paste(tempdir(), "pangenome.RDS", sep = "/")
p$save_pangenomeRDS(file = rds)
p2 <- load_pangenomeRDS(rds)
This method is more stable (compatible between pagoo
versions), secure (uses the same metadata classes, and precision isn’t
lost), and convenient (the exact state of the object saved is restored,
keeping dropped organisms/genes/clusters hidden, available to be
recovered, and other object state configuration,
e.g. core_level
, is also saved).