Querying Data • pagoo

In this section we will start exploring what is stored inside the pagoo object and how we can access this information. Keep in mind that this object has its own associated data and methods that can be easily queried with the $ operator. These methods allow for the rapid subsetting, extraction and visualization of pangenome data.

First of all, we will load a pangenome using a toy dataset included in the package. This is a preloaded set of 10 Campylobacter spp. genomes, with metadata associated.

library(pagoo) # Load package
rds <- system.file('extdata', 'campylobacter.RDS', package = 'pagoo')
p <- load_pangenomeRDS(rds)

Summary statistics

A pangenome can be stratified in different gene subsets according to their frequency in the dataset. The core genes can be defined as those present in all or almost every genome (typically 95-100%). The remaining genes are defined as the accessory genome, that can be subdivided in cloud genes or singletons (present in one genome or in genomes that are identical) and shell genes which are those in the middle. Let’s see this using pagoo:

p$summary_stats

## DataFrame with 4 rows and 2 columns
##      Category    Number
##   <character> <integer>
## 1       Total      2588
## 2        Core      1627
## 3       Shell       413
## 4       Cloud       548

Core level

The core level defines the minimum number of genomes (as a percentage) in which a certain gene should be present to be considered a core gene. By default, pagoo considers as core all genes present in at least 95% of organisms. The core level can be modified to be more or less stringent defining the core genome. This feature exemplifies R6’s reference semantics, since modifying the core level will affect the pangenome object state resulting in different core, shell and cloud sets. Have a look:

p$core_level

## [1] 95

p$core_level <- 100 # Change value
p$summary_stats     # Updated object

## DataFrame with 4 rows and 2 columns
##      Category    Number
##   <character> <integer>
## 1       Total      2588
## 2        Core      1554
## 3       Shell       486
## 4       Cloud       548

As you can see, changing the core level from 95% to a more stringent 100% cause in decrease in the number of core genes from 1627 to 1554, and a concomitant increase in shell genes from 413 to 486. This means that 73 genes migrated from core to shell when increasing the threshold to consider a cluster as “core” to 100%. Now this changes remain in the object for subsequent analysis, or can be reverted by setting the core level again at the original value.

p$core_level <- 95

Pangenome matrix

The pangenome matrix is one of the most useful things when analyzing pangenomes. Typically, it represents organisms in rows and clusters of orthologous in columns informing about gene abundance (considering paralogues). The pangenome matrix looks like this (printing only first 5 columns):

p$pan_matrix[, 1:5]

##            group0001 group0002 group0003 group0004 group0005
## 16244_6_6          1         1         1         1         1
## 16244_6_18         1         1         1         1         1
## 17059_2_16         1         1         1         1         1
## 17059_2_23         1         1         1         1         0
## 17059_2_27         1         1         1         1         1
## 17150_1_73         1         1         1         1         1
## 17059_2_42         1         1         1         1         0

Gene metadata

Individual gene metadata can be accessed by using the $genes suffix. It always contains the gene name, the organism to which it belongs, the cluster to where it was assigned, and a gene identifier (gid) that is mainly used internally to organize the data. Also, it may typically include annotation data, genomic coordinates, etc, but this other metadata is optional. Gene metadata is spitted by cluster, so it consist in a List of DataFrames.

p$genes

## SplitDataFrameList of length 2588
## $group0001
## DataFrame with 7 rows and 10 columns
##     cluster        org             gene                    gid    geneName
##    <factor>   <factor>         <factor>            <character> <character>
## 1 group0001 16244_6_6  16244_6_6_00150  16244_6_6__16244_6_6..        ilvC
## 2 group0001 16244_6_18 16244_6_18_00172 16244_6_18__16244_6_..        ilvC
## 3 group0001 17059_2_16 17059_2_16_01012 17059_2_16__17059_2_..        ilvC
## 4 group0001 17059_2_23 17059_2_23_01040 17059_2_23__17059_2_..        ilvC
## 5 group0001 17059_2_27 17059_2_27_00492 17059_2_27__17059_2_..        ilvC
## 6 group0001 17150_1_73 17150_1_73_00221 17150_1_73__17150_1_..        ilvC
## 7 group0001 17059_2_42 17059_2_42_00176 17059_2_42__17059_2_..        ilvC
##                  product                 contig      from        to      strand
##              <character>            <character> <integer> <integer> <character>
## 1 ketol-acid reductois.. ERS672247|SC|contig0..    137333    138355           +
## 2 ketol-acid reductois.. ERS672259|SC|contig0..    173541    174563           +
## 3 ketol-acid reductois.. ERS739235|SC|contig0..     43645     44667           -
## 4 ketol-acid reductois.. ERS739242|SC|contig0..    173578    174600           +
## 5 ketol-acid reductois.. ERS739246|SC|contig0..    184035    185057           -
## 6 ketol-acid reductois.. ERS686652|SC|contig0..    207206    208228           +
## 7 ketol-acid reductois.. ERS739261|SC|contig0..    173569    174591           +
## 
## $group0002
## DataFrame with 7 rows and 10 columns
##     cluster        org             gene                    gid    geneName
##    <factor>   <factor>         <factor>            <character> <character>
## 1 group0002 16244_6_6  16244_6_6_01290  16244_6_6__16244_6_6..        hprA
## 2 group0002 16244_6_18 16244_6_18_01307 16244_6_18__16244_6_..        hprA
## 3 group0002 17059_2_16 17059_2_16_01627 17059_2_16__17059_2_..        hprA
## 4 group0002 17059_2_23 17059_2_23_00348 17059_2_23__17059_2_..        hprA
## 5 group0002 17059_2_27 17059_2_27_01208 17059_2_27__17059_2_..        hprA
## 6 group0002 17150_1_73 17150_1_73_01398 17150_1_73__17150_1_..        hprA
## 7 group0002 17059_2_42 17059_2_42_00751 17059_2_42__17059_2_..        hprA
##                  product                 contig      from        to      strand
##              <character>            <character> <integer> <integer> <character>
## 1 2-hydroxyacid dehydr.. ERS672247|SC|contig0..    274585    275517           -
## 2 2-hydroxyacid dehydr.. ERS672259|SC|contig0..    272748    273680           -
## 3 2-hydroxyacid dehydr.. ERS739235|SC|contig0..     11617     12549           +
## 4 2-hydroxyacid dehydr.. ERS739242|SC|contig0..    308107    309039           -
## 5 2-hydroxyacid dehydr.. ERS739246|SC|contig0..    114555    115487           -
## 6 2-hydroxyacid dehydr.. ERS686652|SC|contig0..    273866    274798           -
## 7 2-hydroxyacid dehydr.. ERS739261|SC|contig0..     15409     16341           +
## 
## ...
## <2586 more elements>

If you want to work with this data as a single DataFrame, just unlist it:

unlist(p$genes, use.names = FALSE)

pagoo also includes predefined subsets fields to list only certain pangenome category, these are queried by adding a prefix with the desired category followed by an underscore: $core_genes, $shell_genes, and $cloud_genes. These kind of subsets are better explained in the ‘4 - Subets’ tutorial, and also apply to other pangenome data described below.

Clusters metadata

Groups of orthologues (clusters) are also stored in pagoo objects as a table with a cluster identifier per row, and optional metadata associated as additional columns.

p$clusters

## DataFrame with 2588 rows and 2 columns
##        cluster              Pfam_Arch
##       <factor>            <character>
## 1    group0001           2-Hacid_dh_C
## 2    group0002 2-Hacid_dh_C;2-Hacid..
## 3    group0003 2-Hacid_dh_C;ACT;2-H..
## 4    group0004        2Fe-2S_thioredx
## 5    group0005    4HB_MCP_1;MCPsignal
## ...        ...                    ...
## 2584 group2584              zf-RING_7
## 2585 group2585               zf-TFIIB
## 2586 group2586                   ZinT
## 2587 group2587                   ZnuA
## 2588 group2588               ZT_dimer

Subsets also exists for this field: $core_clusters, $shell_clusters, and $cloud_clusters.

Sequences

Although is an optional field (it exists only if user provide this data as an argument when object is created), $sequences gives access to sequence data. Sequences are stored as a List of DNAStringSet (a.k.a DNAStringSetList, Biostrings package), grouped by cluster.

p$sequences             # List all sequences grouped by cluster

## DNAStringSetList of length 2588
## [["group0001"]] 16244_6_6__16244_6_6_00150=ATGGCGATAACAGTTTATTACGACAAAGATTGCG...
## [["group0002"]] 16244_6_6__16244_6_6_01290=ATGAAAATAGTATGCTTAGATGCCGACACGCTTG...
## [["group0003"]] 16244_6_6__16244_6_6_01710=ATGAAAACAGTTATAGTTTGCGATGCAATACATC...
## [["group0004"]] 16244_6_6__16244_6_6_01754=ATGAAATTCGAATTTACTCATGAGCAATTATCGG...
## [["group0005"]] 16244_6_6__16244_6_6_00049=ATGTCAAATTTAACTACTAACTTAACTACCAAAA...
## [["group0006"]] 16244_6_6__16244_6_6_01069=ATGAATTATTTTGAGAATTTAAAAGTTTCAACAA...
## [["group0007"]] 16244_6_6__16244_6_6_01612=ATGCGAATTAGAATTTATTATGAAGATACCGATG...
## [["group0008"]] 16244_6_6__16244_6_6_01679=ATGATGAAAGATATGGGCGAGCCACGTATAAAAA...
## [["group0009"]] 16244_6_18__16244_6_18_01216=ATGGGGCTTACTACGAGTACGACAAAGTATAT...
## [["group0010"]] 16244_6_6__16244_6_6_00758=ATGAAAAGAGTGGTTATAAAAGTAGGCTCTCACG...
## ...
## <2578 more elements>

p$sequences[["group0001"]]  # List first cluster

## DNAStringSet object of length 7:
##     width seq                                               names               
## [1]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 16244_6_6__16244_...
## [2]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 16244_6_18__16244...
## [3]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17059_2_16__17059...
## [4]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17059_2_23__17059...
## [5]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17059_2_27__17059...
## [6]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17150_1_73__17150...
## [7]  1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17059_2_42__17059...

Note that sequence names are created by pasting organism names and gene names, separated by a string that by default is sep = '__' (two underscores). This are the same as the gid column in the $genes field, and are initially set when pagoo object is created. If you think your dataset contain names with this separator, then you should set this parameter to other string to avoid conflicts. $sequences field also has predefined subsets: $core_sequences, $shell_sequences, and $cloud_sequences.

Organism metadata

The $organisms field contain a table with organisms and metadata as additional columns if provided.

p$organisms

## DataFrame with 7 rows and 8 columns
##          org          id      strain      year     country        host
##     <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6         FR15   2008/170h      2008      France       Human
## 2 16244_6_18        FR27   2012/185h      2012      France       Human
## 3 17059_2_16         AR1      99/801      1999   Argentina      Bovine
## 4 17059_2_23         AR8      04/875      2004   Argentina      Bovine
## 5 17059_2_27        AR12      06/195      2006   Argentina      Bovine
## 6 17150_1_73         CA1   001A-0374      2005      Canada       Human
## 7 17059_2_42         TW6        1830      2008      Taiwan       Human
##        source   accession
##   <character> <character>
## 1       Feces   ERS672247
## 2       Blood   ERS672259
## 3     Prepuce   ERS739235
## 4       Fetus   ERS739242
## 5          VM   ERS739246
## 6       Blood   ERS686652
## 7       Blood   ERS739261