Input

Input From Scratch

In order to generate a pagoo object, the only mandatory data structure is a data.frame which has to contain 3 basic columns. Let’s illustrate it with an toy dataset.

First load pagoo:

library(pagoo) # Load package
tgz <- system.file('extdata', 'toy_data.tar.gz', package = 'pagoo')
untar(tarfile = tgz, exdir = tempdir()) # Decompress example dataset
files <- list.files(path = tempdir(), full.names = TRUE, pattern = 'tsv$|fasta$') # List files
files

## [1] "/tmp/Rtmpo5qxP5/case_clusters_meta.tsv"
## [2] "/tmp/Rtmpo5qxP5/case_df.tsv"           
## [3] "/tmp/Rtmpo5qxP5/case_orgs_meta.tsv"    
## [4] "/tmp/Rtmpo5qxP5/organismA.fasta"       
## [5] "/tmp/Rtmpo5qxP5/organismB.fasta"       
## [6] "/tmp/Rtmpo5qxP5/organismC.fasta"       
## [7] "/tmp/Rtmpo5qxP5/organismD.fasta"       
## [8] "/tmp/Rtmpo5qxP5/organismE.fasta"

The file we need now is case_df.tsv. Lets load it and see what’s its structure:

data_file <- grep("case_df.tsv", files, value = TRUE)
data <- read.table(data_file, header = TRUE, sep = '\t', quote = '')
head(data)

##      gene       org cluster                             annot
## 1 gene081 organismA   OG001  Thioesterase superfamily protein
## 2 gene122 organismB   OG001          Thioesterase superfamily
## 3 gene299 organismC   OG001  Thioesterase superfamily protein
## 4 gene186 organismD   OG001  Thioesterase superfamily protein
## 5 gene076 organismE   OG001          Thioesterase superfamily
## 6 gene352 organismA   OG002 Inherit from proNOG: Thioesterase

So it is a data.frame with 4 columns. The first one with the name of each gene, the second one with the organism to which each gene belongs, the third one with the cluster to which each gene was assigned in the pangenome reconstruction, and the last one with annotation metadata for each gene. Of the 4 columns, the former 3 are required, and pagoo will look for columns named “gene”, “org”, and “cluster”. More columns are optional, and you can add as many as you want (or none) to add metadata of each gene.

With only this data (even ignoring the fourth column, which is metadata), you can start working with pagoo:

pg <- pagoo(data = data)

but let’s continue adding more.

Including organism metadata

The next 2 .tsv files contains metadata for each cluster and for each organism, respectively, and are optional arguments.

# Organism metadata
orgs_file <- grep("case_orgs_meta.tsv", files, value = TRUE)
orgs_meta <- read.table(orgs_file, header = TRUE, sep = '\t', quote = '')
head(orgs_meta)

##         org sero  country
## 1 organismA    a Westeros
## 2 organismB    b Westeros
## 3 organismC    c Westeros
## 4 organismD    a    Essos
## 5 organismE    b    Essos

On this data.frame we have a column named org which is mandatory in case you provide this argument. Other columns are metadata associated to each organism. Beware that organisms provided in this table (orgs_meta$org) must coincide with the names provided in the data$org field, in order to correctly map each variables.

Including cluster metadata

Last file contains metadata associated with each cluster of orthologous:

# Cluster metadata
clust_file <- grep("case_clusters_meta.tsv", files, value = TRUE)
clust_meta <- read.table(clust_file, header = TRUE, sep = '\t', quote = '')
head(clust_meta)

##   cluster   kegg  cog
## 1   OG001   <NA>    S
## 2   OG002   <NA>    S
## 3   OG003   <NA> <NA>
## 4   OG004   <NA>    D
## 5   OG005 K01990    V
## 6   OG006   <NA>    V

Again, the column clust_meta$cluster must contain the same elements as data$cluster column to be able to map one into the other.

With all this data the pagoo object will look much more complete. But you can still add sequence information to the pangenome, which makes it much more useful and interesting to work with.

Including sequences

In this made up dataset we have 5 organisms, so if you decide to add sequences to the pangenome you must provide them for all 5 organisms. The type of data needed is a DNA multifasta file for each organism, in which each sequence is a gene whose name can be mapped to the data$gene column. You must first load the sequences into a list, and name each list element as the organism provided in data$org (as well as org_meta$org). The list would look something like:

organism1
- gene1
- gene2
- …
- geneN
organism2
- gene1
- gene2
- …
- geneM
organism3
- gene1
- gene2
- …
- geneP
…

In the case of the example we are working on:

# Sequences
fasta_files <- grep("[.]fasta", files, value = TRUE) # List fasta files
names(fasta_files) <- sub('[.]fasta', '', basename(fasta_files)) # Name them 
# Read fasta files with Biostrings:
library(Biostrings)
sq <- lapply(fasta_files, readDNAStringSet)
class(sq) # Is list?

## [1] "list"

length(sq) # One list element per organism

## [1] 5

names(sq) # Names are the same as in data$org

## [1] "organismA" "organismB" "organismC" "organismD" "organismE"

class(sq[[1]]) # Class of each element of the list

## [1] "DNAStringSet"
## attr(,"package")
## [1] "Biostrings"

And we have a list of DNAStringSet (Biostrings package). Now we can load a quite complete pagoo object (you could still add more metadata to genes, clusters, or organisms):

p <- pagoo(data = data, # Required data
           org_meta = orgs_meta, # Organism's metadata
           cluster_meta = clust_meta, # Cluster's metadata
           sequences = sq) # Sequences

Input From Pangenome Reconstruction Software

All the above stuff with preparing data and loading classes seems difficult and time-consuming, but in real life working datasets this will be rarely needed. We are explaining it here to provide full details about how the software works, but this package also provides functions to automatically read-in output files from pangenome reconstruction software into pagoo, avoiding any formatting or manipulation of data.

Currently pagoo supports input from roary (Page et al., 2015), which has been the standard and most cited software for pangenome reconstruction, and panaroo (Tonkin-Hill et al., 2020). It is worth noticing that as roary become the most used software in this field, other tools as PEPPAN, PRIATE and panaroo include scripts to convert their output to roary’s format. To work with roary’s output, please refer to ?roary_2_pagoo documentation. Although panaroo also includes a tool for this, we provide a function to load their native output format, see ?panaroo_2_pagoo. For both functions you will only need the .gff files used as input for roary of panaroo, and the gene_presence_absence.csv file.

Also, we have created our own pangenome reconstruction software called pewit (Ferrés et al., still unpublished), which automatically generates a pagoo-like object to perform downstream analyses. This object contain all the methods and fields pagoo provides, plus a set of methods and fields exclusive to this software.

Other good pangenome reconstruction software already exists like PanX (Ding et al., 2018), micropan (Snipen & Liland, 2015), GET_HOMOLOGUES (Contreras-Moreira & Vinuesa, 2013), among others. We plan to provide support to some of them in the future.

Adding Metadata After Object Creation

After object creation, you may want to add new metadata given new information or as result of posterior analyses. pagoo objects include a function to add columns of metadata to each gene, each cluster, or each organism. To illustrate it, we will add a new column to the $organisms field named host to add made up information about the host where each genome was isolated from.

host_df <- data.frame(org = p$organisms$org, host = c("Cow", "Dog", "Cat", "Cow", "Sheep"))
p$add_metadata(map = "org", host_df)
p$organisms

## DataFrame with 5 rows and 4 columns
##         org        sero     country        host
##    <factor> <character> <character> <character>
## 1 organismA           a    Westeros         Cow
## 2 organismB           b    Westeros         Dog
## 3 organismC           c    Westeros         Cat
## 4 organismD           a       Essos         Cow
## 5 organismE           b       Essos       Sheep

In order to allow pagoo correct data mapping, the values in the first column of the metadata table should be available at p$organisms$org, and its column header must also be named org.

As said, you can add gene or cluster metadata following the same idea.

Saving and Loading `pagoo` objects

Once loaded, it has two methods for saving and reloading to a new R session. The first one is by saving them as flat (text) files:

tmp <- paste(tempdir(), "pangenome", sep = "/")
p$write_pangenome(dir = tmp)
list.files(tmp, full.names = TRUE)

## [1] "/tmp/Rtmpo5qxP5/pangenome/clusters.tsv" 
## [2] "/tmp/Rtmpo5qxP5/pangenome/data.tsv"     
## [3] "/tmp/Rtmpo5qxP5/pangenome/organisms.tsv"

This creates a directory with 3 text files. The advantage of this approach is that you can analyze it outside R, the disadvantage is that from a reproducibility point of view reading text could be less stable since class or number precision can be lost, and also you can’t save the state of the object in any given time. Only available organisms/genes/clusters are saved, and if you reload the class using the tsv files, any previously dropped organisms/gene/cluster won’t be available any more. (For information about dropping/recovering organisms, see “Subset” tutorial).

If you want to save the object and continue working with it in other R session, we recommend to save them as R objects with the RDS methods provided:

rds <- paste(tempdir(), "pangenome.RDS", sep = "/")
p$save_pangenomeRDS(file = rds)
p2 <- load_pangenomeRDS(rds)

This method is more stable (compatible between pagoo versions), secure (uses the same metadata classes, and precision isn’t lost), and convenient (the exact state of the object saved is restored, keeping dropped organisms/genes/clusters hidden, available to be recovered, and other object state configuration, e.g. core_level, is also saved).