One fundamental operation when working with pangenomic data is to
subset it in order to better analyze it. pagoo
provides
three convenient ways that can help in different scenarios. First we
will see predefined subsets, then we will understand how to use the
classic R’s [
operator with pagoo
objects, and
finally we will see how to temporarily remove one or more organisms from
the dataset, and then to recover them.
Start the tutorial by loading the included Campylobacter pangenome.
library(pagoo) # Load package
toy_rds <- system.file('extdata', 'campylobacter.RDS', package = 'pagoo')
p <- load_pangenomeRDS(toy_rds)
Predefined Subsets
Predefined subsets were already superficially seen in 3 - Querying Data tutorial. Here we will review them and understand its notation.
- | $core_* |
$shell_* |
$cloud_* |
---|---|---|---|
$*_genes |
$core_genes |
$shell_genes |
$cloud_genes |
$*_clusters |
$core_clusters |
$shell_clusters |
$cloud_clusters |
$*_sequences |
$core_sequences |
$shell_sequences |
$cloud_sequences |
As seen in the above table, the notation is quite straightforward.
Lets take clusters
for instance to illustrate it
better.
p$clusters
## DataFrame with 2588 rows and 2 columns
## cluster Pfam_Arch
## <factor> <character>
## 1 group0001 2-Hacid_dh_C
## 2 group0002 2-Hacid_dh_C;2-Hacid..
## 3 group0003 2-Hacid_dh_C;ACT;2-H..
## 4 group0004 2Fe-2S_thioredx
## 5 group0005 4HB_MCP_1;MCPsignal
## ... ... ...
## 2584 group2584 zf-RING_7
## 2585 group2585 zf-TFIIB
## 2586 group2586 ZinT
## 2587 group2587 ZnuA
## 2588 group2588 ZT_dimer
p$core_clusters
## DataFrame with 1627 rows and 2 columns
## cluster Pfam_Arch
## <factor> <character>
## 1 group0001 2-Hacid_dh_C
## 2 group0002 2-Hacid_dh_C;2-Hacid..
## 3 group0003 2-Hacid_dh_C;ACT;2-H..
## 4 group0004 2Fe-2S_thioredx
## 5 group0006 4HB_MCP_1;MCPsignal
## ... ... ...
## 1623 group2584 zf-RING_7
## 1624 group2585 zf-TFIIB
## 1625 group2586 ZinT
## 1626 group2587 ZnuA
## 1627 group2588 ZT_dimer
p$shell_clusters
## DataFrame with 413 rows and 2 columns
## cluster Pfam_Arch
## <factor> <character>
## 1 group0005 4HB_MCP_1;MCPsignal
## 2 group0009 5_nucleotid_C
## 3 group0015 AAA
## 4 group0019 AAA_22
## 5 group0021 AAA_22
## ... ... ...
## 409 group2537 VirB3;CagE_TrbE_VirB
## 410 group2541 VirB8
## 411 group2543 VirB8
## 412 group2554 Y1_Tnp
## 413 group2557 YafQ_toxin
p$cloud_clusters
## DataFrame with 548 rows and 2 columns
## cluster Pfam_Arch
## <factor> <character>
## 1 group0018 AAA_22
## 2 group0020 AAA_22
## 3 group0024 AAA_25
## 4 group0025 AAA_25
## 5 group0028 AAA_28
## ... ... ...
## 544 group2555 YafQ_toxin
## 545 group2566 YhcG_C
## 546 group2573 YopX
## 547 group2578 Zeta_toxin
## 548 group2579 Zeta_toxin
The [
operator
Using [
for subset vectors, lists, or matrices in R is
one of the most common operations we, as R users, do every day. Methods
using [
for subsetting can be divided into 2 types:
- Subsetting vector or list like objects.
- Subsetting matrix-like objects.
In the following subsections we will see how to use them to subset
$pan_matrix
, $genes
, $clusters
,
$sequences
, or$organisms
fields.
Vector/List notation
When we subset a vector or list in R we use a single number or a
vector of numbers, lets say from i
to j
, to
pick those elements by indexes: x[i:j]
. In
pagoo
we implement a method for this generic function to
subset data fields, where indexes represent clusters.
So instead of subsetting directly the final object (which you can of
course) you apply this method directly to the object, and then select
any of the data fields.
# From clusters 1 to 3
p[c(1, 5, 18)]$pan_matrix
## group0001 group0005 group0018
## 16244_6_6 1 1 0
## 16244_6_18 1 1 0
## 17059_2_16 1 1 0
## 17059_2_23 1 0 0
## 17059_2_27 1 1 0
## 17150_1_73 1 1 0
## 17059_2_42 1 0 1
p[c(1, 5, 18)]$genes
## SplitDataFrameList of length 3
## $group0001
## DataFrame with 7 rows and 10 columns
## cluster org gene gid geneName
## <factor> <factor> <factor> <character> <character>
## 1 group0001 16244_6_6 16244_6_6_00150 16244_6_6__16244_6_6.. ilvC
## 2 group0001 16244_6_18 16244_6_18_00172 16244_6_18__16244_6_.. ilvC
## 3 group0001 17059_2_16 17059_2_16_01012 17059_2_16__17059_2_.. ilvC
## 4 group0001 17059_2_23 17059_2_23_01040 17059_2_23__17059_2_.. ilvC
## 5 group0001 17059_2_27 17059_2_27_00492 17059_2_27__17059_2_.. ilvC
## 6 group0001 17150_1_73 17150_1_73_00221 17150_1_73__17150_1_.. ilvC
## 7 group0001 17059_2_42 17059_2_42_00176 17059_2_42__17059_2_.. ilvC
## product contig from to strand
## <character> <character> <integer> <integer> <character>
## 1 ketol-acid reductois.. ERS672247|SC|contig0.. 137333 138355 +
## 2 ketol-acid reductois.. ERS672259|SC|contig0.. 173541 174563 +
## 3 ketol-acid reductois.. ERS739235|SC|contig0.. 43645 44667 -
## 4 ketol-acid reductois.. ERS739242|SC|contig0.. 173578 174600 +
## 5 ketol-acid reductois.. ERS739246|SC|contig0.. 184035 185057 -
## 6 ketol-acid reductois.. ERS686652|SC|contig0.. 207206 208228 +
## 7 ketol-acid reductois.. ERS739261|SC|contig0.. 173569 174591 +
##
## $group0005
## DataFrame with 5 rows and 10 columns
## cluster org gene gid geneName
## <factor> <factor> <factor> <character> <character>
## 1 group0005 16244_6_6 16244_6_6_00049 16244_6_6__16244_6_6.. mcpB_1
## 2 group0005 16244_6_18 16244_6_18_00072 16244_6_18__16244_6_.. mcpB_1
## 3 group0005 17059_2_16 17059_2_16_01270 17059_2_16__17059_2_.. mcpB_3
## 4 group0005 17059_2_27 17059_2_27_00592 17059_2_27__17059_2_.. mcpB_1
## 5 group0005 17150_1_73 17150_1_73_00073 17150_1_73__17150_1_.. mcpB_1
## product contig from to strand
## <character> <character> <integer> <integer> <character>
## 1 methyl-accepting che.. ERS672247|SC|contig0.. 46407 48068 +
## 2 methyl-accepting che.. ERS672259|SC|contig0.. 82617 84278 +
## 3 methyl-accepting che.. ERS739235|SC|contig0.. 82775 84436 +
## 4 methyl-accepting che.. ERS739246|SC|contig0.. 274321 275982 -
## 5 methyl-accepting che.. ERS686652|SC|contig0.. 82648 84309 +
##
## $group0018
## DataFrame with 1 row and 10 columns
## cluster org gene gid geneName
## <factor> <factor> <factor> <character> <character>
## 1 group0018 17059_2_42 17059_2_42_00333 17059_2_42__17059_2_..
## product contig from to strand
## <character> <character> <integer> <integer> <character>
## 1 bacteriocin,putative.. ERS739261|SC|contig0.. 303996 304949 -
p[c(1, 5, 18)]$clusters
## DataFrame with 3 rows and 2 columns
## cluster Pfam_Arch
## <factor> <character>
## 1 group0001 2-Hacid_dh_C
## 2 group0005 4HB_MCP_1;MCPsignal
## 3 group0018 AAA_22
p[c(1, 5, 18)]$sequences
## DNAStringSetList of length 3
## [["group0001"]] 16244_6_18__16244_6_18_00172=ATGGCGATAACAGTTTATTACGACAAAGATTG...
## [["group0005"]] 16244_6_18__16244_6_18_00072=ATGTCAAATTTAACTACTAATTTAACTACCAA...
## [["group0018"]] 17059_2_42__17059_2_42_00333=ATGAGTAAATTTGAAGATATTAGGAATGAGCT...
p[c(1, 5, 18)]$organisms
## DataFrame with 7 rows and 8 columns
## org id strain year country host
## <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6 FR15 2008/170h 2008 France Human
## 2 16244_6_18 FR27 2012/185h 2012 France Human
## 3 17059_2_16 AR1 99/801 1999 Argentina Bovine
## 4 17059_2_23 AR8 04/875 2004 Argentina Bovine
## 5 17059_2_27 AR12 06/195 2006 Argentina Bovine
## 6 17150_1_73 CA1 001A-0374 2005 Canada Human
## 7 17059_2_42 TW6 1830 2008 Taiwan Human
## source accession
## <character> <character>
## 1 Feces ERS672247
## 2 Blood ERS672259
## 3 Prepuce ERS739235
## 4 Fetus ERS739242
## 5 VM ERS739246
## 6 Blood ERS686652
## 7 Blood ERS739261
Note that in each case, indexes can be interpreted as clusters. In
pan_matrix
they refer to columns, in genes
and
sequences
to elements in a list, in clusters
rows in the dataframe, and in the case of $organisms
it
returns the organisms (rows) where those clusters are present. See what
happen for this last case when we query for organisms present for a
shell cluster:
shell_clust <- p$shell_clusters$cluster[1] # [1] "OG0005"
p[shell_clust]$organisms
## DataFrame with 5 rows and 8 columns
## org id strain year country host
## <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6 FR15 2008/170h 2008 France Human
## 2 16244_6_18 FR27 2012/185h 2012 France Human
## 3 17059_2_16 AR1 99/801 1999 Argentina Bovine
## 5 17059_2_27 AR12 06/195 2006 Argentina Bovine
## 6 17150_1_73 CA1 001A-0374 2005 Canada Human
## source accession
## <character> <character>
## 1 Feces ERS672247
## 2 Blood ERS672259
## 3 Prepuce ERS739235
## 5 VM ERS739246
## 6 Blood ERS686652
Only (6) organisms which contain that shell cluster will be listed.
Matrix notation
The other use of [
notation is when we subset a
matrix-like object. In this case we provide 2 sets of indexes: the first
for rows, and the second for columns, separated by a coma.
pagoo
interprets these indexes as organisms (rows) and
clusters (columns), so you are referencing a set of genes (cell value).
Basically you are referencing cells in the pan_matrix.
p$pan_matrix[1:3, c(1, 5, 18)]
## group0001 group0005 group0018
## 16244_6_6 1 1 0
## 16244_6_18 1 1 0
## 17059_2_16 1 1 0
Here we are selecting the first 3 organisms, and the clusters indexed
as 1, 5, and 18. Note that we see (sum columns) 3 genes in the first
cluster, 3 in the second, and 0 in the third. Now we flip the notation
and subset directly the pagoo
object, and then ask for a
data field:
p[1:3, c(1, 5, 18)]$pan_matrix # The same as above
## group0001 group0005 group0018
## 16244_6_6 1 1 0
## 16244_6_18 1 1 0
## 17059_2_16 1 1 0
p[1:3, c(1, 5, 18)]$genes
## SplitDataFrameList of length 2
## $group0001
## DataFrame with 3 rows and 10 columns
## cluster org gene gid geneName
## <factor> <factor> <factor> <character> <character>
## 1 group0001 16244_6_6 16244_6_6_00150 16244_6_6__16244_6_6.. ilvC
## 2 group0001 16244_6_18 16244_6_18_00172 16244_6_18__16244_6_.. ilvC
## 3 group0001 17059_2_16 17059_2_16_01012 17059_2_16__17059_2_.. ilvC
## product contig from to strand
## <character> <character> <integer> <integer> <character>
## 1 ketol-acid reductois.. ERS672247|SC|contig0.. 137333 138355 +
## 2 ketol-acid reductois.. ERS672259|SC|contig0.. 173541 174563 +
## 3 ketol-acid reductois.. ERS739235|SC|contig0.. 43645 44667 -
##
## $group0005
## DataFrame with 3 rows and 10 columns
## cluster org gene gid geneName
## <factor> <factor> <factor> <character> <character>
## 1 group0005 16244_6_6 16244_6_6_00049 16244_6_6__16244_6_6.. mcpB_1
## 2 group0005 16244_6_18 16244_6_18_00072 16244_6_18__16244_6_.. mcpB_1
## 3 group0005 17059_2_16 17059_2_16_01270 17059_2_16__17059_2_.. mcpB_3
## product contig from to strand
## <character> <character> <integer> <integer> <character>
## 1 methyl-accepting che.. ERS672247|SC|contig0.. 46407 48068 +
## 2 methyl-accepting che.. ERS672259|SC|contig0.. 82617 84278 +
## 3 methyl-accepting che.. ERS739235|SC|contig0.. 82775 84436 +
p[1:3, c(1, 5, 18)]$clusters
## DataFrame with 2 rows and 2 columns
## cluster Pfam_Arch
## <factor> <character>
## 1 group0001 2-Hacid_dh_C
## 2 group0005 4HB_MCP_1;MCPsignal
p[1:3, c(1, 5, 18)]$sequences
## DNAStringSetList of length 2
## [["group0001"]] 16244_6_18__16244_6_18_00172=ATGGCGATAACAGTTTATTACGACAAAGATTG...
## [["group0005"]] 16244_6_18__16244_6_18_00072=ATGTCAAATTTAACTACTAATTTAACTACCAA...
p[1:3, c(1, 5, 18)]$organisms
## DataFrame with 3 rows and 8 columns
## org id strain year country host
## <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6 FR15 2008/170h 2008 France Human
## 2 16244_6_18 FR27 2012/185h 2012 France Human
## 3 17059_2_16 AR1 99/801 1999 Argentina Bovine
## source accession
## <character> <character>
## 1 Feces ERS672247
## 2 Blood ERS672259
## 3 Prepuce ERS739235
So this selection returns a list of length 2, with 3 elements each
for $genes
and $sequences
, a dataframe showing
only the selected clusters in the $clusters
field, and a
dataframe showing only the selected organisms in the
$organisms
field.
We found this implementation quite useful for both data exploration and fine grained analysis.
Dropping and Recovering Organisms
One useful feature implemented in pagoo
is the
possibility of easily removing (hiding) organisms from the dataset. This
is useful if during the analysis we identify some genome with weird
characteristics (i.e. potentially contaminated), or if we want to focus
just in a subset of genomes of interest given any metadata value, or if
we included an outgroup for phylogenetic purposes but we want to remove
it from downstream analyses. Let’s see how it works..
Dropping organisms
p$organisms
## DataFrame with 7 rows and 8 columns
## org id strain year country host
## <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6 FR15 2008/170h 2008 France Human
## 2 16244_6_18 FR27 2012/185h 2012 France Human
## 3 17059_2_16 AR1 99/801 1999 Argentina Bovine
## 4 17059_2_23 AR8 04/875 2004 Argentina Bovine
## 5 17059_2_27 AR12 06/195 2006 Argentina Bovine
## 6 17150_1_73 CA1 001A-0374 2005 Canada Human
## 7 17059_2_42 TW6 1830 2008 Taiwan Human
## source accession
## <character> <character>
## 1 Feces ERS672247
## 2 Blood ERS672259
## 3 Prepuce ERS739235
## 4 Fetus ERS739242
## 5 VM ERS739246
## 6 Blood ERS686652
## 7 Blood ERS739261
p$summary_stats
## DataFrame with 4 rows and 2 columns
## Category Number
## <character> <integer>
## 1 Total 2588
## 2 Core 1627
## 3 Shell 413
## 4 Cloud 548
We can see that we have 7 organisms, and we have the pangenome summary statistics. Let’s say, for instance, that we want to exclude french isolates (because any reason) from the dataset, which are indexed 1 and 2 (rows) in the above dataframe.
p$drop(1:2)
p$organisms # Updated organisms !!
## DataFrame with 5 rows and 8 columns
## org id strain year country host
## <factor> <character> <character> <integer> <character> <character>
## 3 17059_2_16 AR1 99/801 1999 Argentina Bovine
## 4 17059_2_23 AR8 04/875 2004 Argentina Bovine
## 5 17059_2_27 AR12 06/195 2006 Argentina Bovine
## 6 17150_1_73 CA1 001A-0374 2005 Canada Human
## 7 17059_2_42 TW6 1830 2008 Taiwan Human
## source accession
## <character> <character>
## 3 Prepuce ERS739235
## 4 Fetus ERS739242
## 5 VM ERS739246
## 6 Blood ERS686652
## 7 Blood ERS739261
p$summary_stats # Updated stats !!
## DataFrame with 4 rows and 2 columns
## Category Number
## <character> <integer>
## 1 Total 2504
## 2 Core 1648
## 3 Shell 309
## 4 Cloud 547
As you see, not only the french organisms have been removed from the
dataset, but also summary statistics have been updated! This is also
true for any other data field: $pan_matrix
,
$genes
, $clusters
, $sequences
,
and$organisms
fields are automatically updated. Also any
embedded statistical or visualization methods will now only consider
available organisms/clusters/genes. When we drop an organisms, we are
hiding all features associated to that genomes including genes,
sequences, gene clusters and metadata.
It’s important to note that you don’t have to reassign the object to a new one, it is self modified (in place modification). That’s R6 reference semantics (see R6 documentation for details), use with caution.
Recovering dropped organisms
You can recover dropped organisms. To see if we have any hidden
organism, we use the $dropped
field, and then recover it
using its index or its name.
p$dropped
## 1 2
## "16244_6_6" "16244_6_18"
p$recover( p$dropped )
p$organisms
## DataFrame with 7 rows and 8 columns
## org id strain year country host
## <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6 FR15 2008/170h 2008 France Human
## 2 16244_6_18 FR27 2012/185h 2012 France Human
## 3 17059_2_16 AR1 99/801 1999 Argentina Bovine
## 4 17059_2_23 AR8 04/875 2004 Argentina Bovine
## 5 17059_2_27 AR12 06/195 2006 Argentina Bovine
## 6 17150_1_73 CA1 001A-0374 2005 Canada Human
## 7 17059_2_42 TW6 1830 2008 Taiwan Human
## source accession
## <character> <character>
## 1 Feces ERS672247
## 2 Blood ERS672259
## 3 Prepuce ERS739235
## 4 Fetus ERS739242
## 5 VM ERS739246
## 6 Blood ERS686652
## 7 Blood ERS739261
Now french isolates and all its features are again available.