Subseting

One fundamental operation when working with pangenomic data is to subset it in order to better analyze it. pagoo provides three convenient ways that can help in different scenarios. First we will see predefined subsets, then we will understand how to use the classic R’s [ operator with pagoo objects, and finally we will see how to temporarily remove one or more organisms from the dataset, and then to recover them.

Start the tutorial by loading the included Campylobacter pangenome.

library(pagoo) # Load package
toy_rds <- system.file('extdata', 'campylobacter.RDS', package = 'pagoo')
p <- load_pangenomeRDS(toy_rds)

Predefined Subsets

Predefined subsets were already superficially seen in 3 - Querying Data tutorial. Here we will review them and understand its notation.

-	`$core_*`	`$shell_*`	`$cloud_*`
*`$_genes`**	`$core_genes`	`$shell_genes`	`$cloud_genes`
*`$_clusters`**	`$core_clusters`	`$shell_clusters`	`$cloud_clusters`
*`$_sequences`**	`$core_sequences`	`$shell_sequences`	`$cloud_sequences`

As seen in the above table, the notation is quite straightforward. Lets take clusters for instance to illustrate it better.

p$clusters

## DataFrame with 2588 rows and 2 columns
##        cluster              Pfam_Arch
##       <factor>            <character>
## 1    group0001           2-Hacid_dh_C
## 2    group0002 2-Hacid_dh_C;2-Hacid..
## 3    group0003 2-Hacid_dh_C;ACT;2-H..
## 4    group0004        2Fe-2S_thioredx
## 5    group0005    4HB_MCP_1;MCPsignal
## ...        ...                    ...
## 2584 group2584              zf-RING_7
## 2585 group2585               zf-TFIIB
## 2586 group2586                   ZinT
## 2587 group2587                   ZnuA
## 2588 group2588               ZT_dimer

p$core_clusters

## DataFrame with 1627 rows and 2 columns
##        cluster              Pfam_Arch
##       <factor>            <character>
## 1    group0001           2-Hacid_dh_C
## 2    group0002 2-Hacid_dh_C;2-Hacid..
## 3    group0003 2-Hacid_dh_C;ACT;2-H..
## 4    group0004        2Fe-2S_thioredx
## 5    group0006    4HB_MCP_1;MCPsignal
## ...        ...                    ...
## 1623 group2584              zf-RING_7
## 1624 group2585               zf-TFIIB
## 1625 group2586                   ZinT
## 1626 group2587                   ZnuA
## 1627 group2588               ZT_dimer

p$shell_clusters

## DataFrame with 413 rows and 2 columns
##       cluster            Pfam_Arch
##      <factor>          <character>
## 1   group0005  4HB_MCP_1;MCPsignal
## 2   group0009        5_nucleotid_C
## 3   group0015                  AAA
## 4   group0019               AAA_22
## 5   group0021               AAA_22
## ...       ...                  ...
## 409 group2537 VirB3;CagE_TrbE_VirB
## 410 group2541                VirB8
## 411 group2543                VirB8
## 412 group2554               Y1_Tnp
## 413 group2557           YafQ_toxin

p$cloud_clusters

## DataFrame with 548 rows and 2 columns
##       cluster   Pfam_Arch
##      <factor> <character>
## 1   group0018      AAA_22
## 2   group0020      AAA_22
## 3   group0024      AAA_25
## 4   group0025      AAA_25
## 5   group0028      AAA_28
## ...       ...         ...
## 544 group2555  YafQ_toxin
## 545 group2566      YhcG_C
## 546 group2573        YopX
## 547 group2578  Zeta_toxin
## 548 group2579  Zeta_toxin

The `[` operator

Using [ for subset vectors, lists, or matrices in R is one of the most common operations we, as R users, do every day. Methods using [ for subsetting can be divided into 2 types:

Subsetting vector or list like objects.
Subsetting matrix-like objects.

In the following subsections we will see how to use them to subset $pan_matrix, $genes, $clusters, $sequences, or$organisms fields.

Vector/List notation

When we subset a vector or list in R we use a single number or a vector of numbers, lets say from i to j, to pick those elements by indexes: x[i:j]. In pagoo we implement a method for this generic function to subset data fields, where indexes represent clusters. So instead of subsetting directly the final object (which you can of course) you apply this method directly to the object, and then select any of the data fields.

# From clusters 1 to 3
p[c(1, 5, 18)]$pan_matrix

##            group0001 group0005 group0018
## 16244_6_6          1         1         0
## 16244_6_18         1         1         0
## 17059_2_16         1         1         0
## 17059_2_23         1         0         0
## 17059_2_27         1         1         0
## 17150_1_73         1         1         0
## 17059_2_42         1         0         1

p[c(1, 5, 18)]$genes

## SplitDataFrameList of length 3
## $group0001
## DataFrame with 7 rows and 10 columns
##     cluster        org             gene                    gid    geneName
##    <factor>   <factor>         <factor>            <character> <character>
## 1 group0001 16244_6_6  16244_6_6_00150  16244_6_6__16244_6_6..        ilvC
## 2 group0001 16244_6_18 16244_6_18_00172 16244_6_18__16244_6_..        ilvC
## 3 group0001 17059_2_16 17059_2_16_01012 17059_2_16__17059_2_..        ilvC
## 4 group0001 17059_2_23 17059_2_23_01040 17059_2_23__17059_2_..        ilvC
## 5 group0001 17059_2_27 17059_2_27_00492 17059_2_27__17059_2_..        ilvC
## 6 group0001 17150_1_73 17150_1_73_00221 17150_1_73__17150_1_..        ilvC
## 7 group0001 17059_2_42 17059_2_42_00176 17059_2_42__17059_2_..        ilvC
##                  product                 contig      from        to      strand
##              <character>            <character> <integer> <integer> <character>
## 1 ketol-acid reductois.. ERS672247|SC|contig0..    137333    138355           +
## 2 ketol-acid reductois.. ERS672259|SC|contig0..    173541    174563           +
## 3 ketol-acid reductois.. ERS739235|SC|contig0..     43645     44667           -
## 4 ketol-acid reductois.. ERS739242|SC|contig0..    173578    174600           +
## 5 ketol-acid reductois.. ERS739246|SC|contig0..    184035    185057           -
## 6 ketol-acid reductois.. ERS686652|SC|contig0..    207206    208228           +
## 7 ketol-acid reductois.. ERS739261|SC|contig0..    173569    174591           +
## 
## $group0005
## DataFrame with 5 rows and 10 columns
##     cluster        org             gene                    gid    geneName
##    <factor>   <factor>         <factor>            <character> <character>
## 1 group0005 16244_6_6  16244_6_6_00049  16244_6_6__16244_6_6..      mcpB_1
## 2 group0005 16244_6_18 16244_6_18_00072 16244_6_18__16244_6_..      mcpB_1
## 3 group0005 17059_2_16 17059_2_16_01270 17059_2_16__17059_2_..      mcpB_3
## 4 group0005 17059_2_27 17059_2_27_00592 17059_2_27__17059_2_..      mcpB_1
## 5 group0005 17150_1_73 17150_1_73_00073 17150_1_73__17150_1_..      mcpB_1
##                  product                 contig      from        to      strand
##              <character>            <character> <integer> <integer> <character>
## 1 methyl-accepting che.. ERS672247|SC|contig0..     46407     48068           +
## 2 methyl-accepting che.. ERS672259|SC|contig0..     82617     84278           +
## 3 methyl-accepting che.. ERS739235|SC|contig0..     82775     84436           +
## 4 methyl-accepting che.. ERS739246|SC|contig0..    274321    275982           -
## 5 methyl-accepting che.. ERS686652|SC|contig0..     82648     84309           +
## 
## $group0018
## DataFrame with 1 row and 10 columns
##     cluster        org             gene                    gid    geneName
##    <factor>   <factor>         <factor>            <character> <character>
## 1 group0018 17059_2_42 17059_2_42_00333 17059_2_42__17059_2_..            
##                  product                 contig      from        to      strand
##              <character>            <character> <integer> <integer> <character>
## 1 bacteriocin,putative.. ERS739261|SC|contig0..    303996    304949           -

p[c(1, 5, 18)]$clusters

## DataFrame with 3 rows and 2 columns
##     cluster           Pfam_Arch
##    <factor>         <character>
## 1 group0001        2-Hacid_dh_C
## 2 group0005 4HB_MCP_1;MCPsignal
## 3 group0018              AAA_22

p[c(1, 5, 18)]$sequences

## DNAStringSetList of length 3
## [["group0001"]] 16244_6_18__16244_6_18_00172=ATGGCGATAACAGTTTATTACGACAAAGATTG...
## [["group0005"]] 16244_6_18__16244_6_18_00072=ATGTCAAATTTAACTACTAATTTAACTACCAA...
## [["group0018"]] 17059_2_42__17059_2_42_00333=ATGAGTAAATTTGAAGATATTAGGAATGAGCT...

p[c(1, 5, 18)]$organisms

## DataFrame with 7 rows and 8 columns
##          org          id      strain      year     country        host
##     <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6         FR15   2008/170h      2008      France       Human
## 2 16244_6_18        FR27   2012/185h      2012      France       Human
## 3 17059_2_16         AR1      99/801      1999   Argentina      Bovine
## 4 17059_2_23         AR8      04/875      2004   Argentina      Bovine
## 5 17059_2_27        AR12      06/195      2006   Argentina      Bovine
## 6 17150_1_73         CA1   001A-0374      2005      Canada       Human
## 7 17059_2_42         TW6        1830      2008      Taiwan       Human
##        source   accession
##   <character> <character>
## 1       Feces   ERS672247
## 2       Blood   ERS672259
## 3     Prepuce   ERS739235
## 4       Fetus   ERS739242
## 5          VM   ERS739246
## 6       Blood   ERS686652
## 7       Blood   ERS739261

Note that in each case, indexes can be interpreted as clusters. In pan_matrix they refer to columns, in genes and sequences to elements in a list, in clusters rows in the dataframe, and in the case of $organisms it returns the organisms (rows) where those clusters are present. See what happen for this last case when we query for organisms present for a shell cluster:

shell_clust <- p$shell_clusters$cluster[1] # [1] "OG0005"
p[shell_clust]$organisms

## DataFrame with 5 rows and 8 columns
##          org          id      strain      year     country        host
##     <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6         FR15   2008/170h      2008      France       Human
## 2 16244_6_18        FR27   2012/185h      2012      France       Human
## 3 17059_2_16         AR1      99/801      1999   Argentina      Bovine
## 5 17059_2_27        AR12      06/195      2006   Argentina      Bovine
## 6 17150_1_73         CA1   001A-0374      2005      Canada       Human
##        source   accession
##   <character> <character>
## 1       Feces   ERS672247
## 2       Blood   ERS672259
## 3     Prepuce   ERS739235
## 5          VM   ERS739246
## 6       Blood   ERS686652

Only (6) organisms which contain that shell cluster will be listed.

Matrix notation

The other use of [ notation is when we subset a matrix-like object. In this case we provide 2 sets of indexes: the first for rows, and the second for columns, separated by a coma. pagoo interprets these indexes as organisms (rows) and clusters (columns), so you are referencing a set of genes (cell value). Basically you are referencing cells in the pan_matrix.

p$pan_matrix[1:3, c(1, 5, 18)]

##            group0001 group0005 group0018
## 16244_6_6          1         1         0
## 16244_6_18         1         1         0
## 17059_2_16         1         1         0

Here we are selecting the first 3 organisms, and the clusters indexed as 1, 5, and 18. Note that we see (sum columns) 3 genes in the first cluster, 3 in the second, and 0 in the third. Now we flip the notation and subset directly the pagoo object, and then ask for a data field:

p[1:3, c(1, 5, 18)]$pan_matrix # The same as above

##            group0001 group0005 group0018
## 16244_6_6          1         1         0
## 16244_6_18         1         1         0
## 17059_2_16         1         1         0

p[1:3, c(1, 5, 18)]$genes

## SplitDataFrameList of length 2
## $group0001
## DataFrame with 3 rows and 10 columns
##     cluster        org             gene                    gid    geneName
##    <factor>   <factor>         <factor>            <character> <character>
## 1 group0001 16244_6_6  16244_6_6_00150  16244_6_6__16244_6_6..        ilvC
## 2 group0001 16244_6_18 16244_6_18_00172 16244_6_18__16244_6_..        ilvC
## 3 group0001 17059_2_16 17059_2_16_01012 17059_2_16__17059_2_..        ilvC
##                  product                 contig      from        to      strand
##              <character>            <character> <integer> <integer> <character>
## 1 ketol-acid reductois.. ERS672247|SC|contig0..    137333    138355           +
## 2 ketol-acid reductois.. ERS672259|SC|contig0..    173541    174563           +
## 3 ketol-acid reductois.. ERS739235|SC|contig0..     43645     44667           -
## 
## $group0005
## DataFrame with 3 rows and 10 columns
##     cluster        org             gene                    gid    geneName
##    <factor>   <factor>         <factor>            <character> <character>
## 1 group0005 16244_6_6  16244_6_6_00049  16244_6_6__16244_6_6..      mcpB_1
## 2 group0005 16244_6_18 16244_6_18_00072 16244_6_18__16244_6_..      mcpB_1
## 3 group0005 17059_2_16 17059_2_16_01270 17059_2_16__17059_2_..      mcpB_3
##                  product                 contig      from        to      strand
##              <character>            <character> <integer> <integer> <character>
## 1 methyl-accepting che.. ERS672247|SC|contig0..     46407     48068           +
## 2 methyl-accepting che.. ERS672259|SC|contig0..     82617     84278           +
## 3 methyl-accepting che.. ERS739235|SC|contig0..     82775     84436           +

p[1:3, c(1, 5, 18)]$clusters

## DataFrame with 2 rows and 2 columns
##     cluster           Pfam_Arch
##    <factor>         <character>
## 1 group0001        2-Hacid_dh_C
## 2 group0005 4HB_MCP_1;MCPsignal

p[1:3, c(1, 5, 18)]$sequences

## DNAStringSetList of length 2
## [["group0001"]] 16244_6_18__16244_6_18_00172=ATGGCGATAACAGTTTATTACGACAAAGATTG...
## [["group0005"]] 16244_6_18__16244_6_18_00072=ATGTCAAATTTAACTACTAATTTAACTACCAA...

p[1:3, c(1, 5, 18)]$organisms

## DataFrame with 3 rows and 8 columns
##          org          id      strain      year     country        host
##     <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6         FR15   2008/170h      2008      France       Human
## 2 16244_6_18        FR27   2012/185h      2012      France       Human
## 3 17059_2_16         AR1      99/801      1999   Argentina      Bovine
##        source   accession
##   <character> <character>
## 1       Feces   ERS672247
## 2       Blood   ERS672259
## 3     Prepuce   ERS739235

So this selection returns a list of length 2, with 3 elements each for $genes and $sequences, a dataframe showing only the selected clusters in the $clusters field, and a dataframe showing only the selected organisms in the $organisms field.

We found this implementation quite useful for both data exploration and fine grained analysis.

Dropping and Recovering Organisms

One useful feature implemented in pagoo is the possibility of easily removing (hiding) organisms from the dataset. This is useful if during the analysis we identify some genome with weird characteristics (i.e. potentially contaminated), or if we want to focus just in a subset of genomes of interest given any metadata value, or if we included an outgroup for phylogenetic purposes but we want to remove it from downstream analyses. Let’s see how it works..

Dropping organisms

p$organisms

## DataFrame with 7 rows and 8 columns
##          org          id      strain      year     country        host
##     <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6         FR15   2008/170h      2008      France       Human
## 2 16244_6_18        FR27   2012/185h      2012      France       Human
## 3 17059_2_16         AR1      99/801      1999   Argentina      Bovine
## 4 17059_2_23         AR8      04/875      2004   Argentina      Bovine
## 5 17059_2_27        AR12      06/195      2006   Argentina      Bovine
## 6 17150_1_73         CA1   001A-0374      2005      Canada       Human
## 7 17059_2_42         TW6        1830      2008      Taiwan       Human
##        source   accession
##   <character> <character>
## 1       Feces   ERS672247
## 2       Blood   ERS672259
## 3     Prepuce   ERS739235
## 4       Fetus   ERS739242
## 5          VM   ERS739246
## 6       Blood   ERS686652
## 7       Blood   ERS739261

p$summary_stats

## DataFrame with 4 rows and 2 columns
##      Category    Number
##   <character> <integer>
## 1       Total      2588
## 2        Core      1627
## 3       Shell       413
## 4       Cloud       548

We can see that we have 7 organisms, and we have the pangenome summary statistics. Let’s say, for instance, that we want to exclude french isolates (because any reason) from the dataset, which are indexed 1 and 2 (rows) in the above dataframe.

p$drop(1:2) 
p$organisms # Updated organisms !!

## DataFrame with 5 rows and 8 columns
##          org          id      strain      year     country        host
##     <factor> <character> <character> <integer> <character> <character>
## 3 17059_2_16         AR1      99/801      1999   Argentina      Bovine
## 4 17059_2_23         AR8      04/875      2004   Argentina      Bovine
## 5 17059_2_27        AR12      06/195      2006   Argentina      Bovine
## 6 17150_1_73         CA1   001A-0374      2005      Canada       Human
## 7 17059_2_42         TW6        1830      2008      Taiwan       Human
##        source   accession
##   <character> <character>
## 3     Prepuce   ERS739235
## 4       Fetus   ERS739242
## 5          VM   ERS739246
## 6       Blood   ERS686652
## 7       Blood   ERS739261

p$summary_stats # Updated stats !!

## DataFrame with 4 rows and 2 columns
##      Category    Number
##   <character> <integer>
## 1       Total      2504
## 2        Core      1648
## 3       Shell       309
## 4       Cloud       547

As you see, not only the french organisms have been removed from the dataset, but also summary statistics have been updated! This is also true for any other data field: $pan_matrix, $genes, $clusters, $sequences, and$organisms fields are automatically updated. Also any embedded statistical or visualization methods will now only consider available organisms/clusters/genes. When we drop an organisms, we are hiding all features associated to that genomes including genes, sequences, gene clusters and metadata.

It’s important to note that you don’t have to reassign the object to a new one, it is self modified (in place modification). That’s R6 reference semantics (see R6 documentation for details), use with caution.

Recovering dropped organisms

You can recover dropped organisms. To see if we have any hidden organism, we use the $dropped field, and then recover it using its index or its name.

p$dropped

##            1            2 
##  "16244_6_6" "16244_6_18"

p$recover( p$dropped )
p$organisms

## DataFrame with 7 rows and 8 columns
##          org          id      strain      year     country        host
##     <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6         FR15   2008/170h      2008      France       Human
## 2 16244_6_18        FR27   2012/185h      2012      France       Human
## 3 17059_2_16         AR1      99/801      1999   Argentina      Bovine
## 4 17059_2_23         AR8      04/875      2004   Argentina      Bovine
## 5 17059_2_27        AR12      06/195      2006   Argentina      Bovine
## 6 17150_1_73         CA1   001A-0374      2005      Canada       Human
## 7 17059_2_42         TW6        1830      2008      Taiwan       Human
##        source   accession
##   <character> <character>
## 1       Feces   ERS672247
## 2       Blood   ERS672259
## 3     Prepuce   ERS739235
## 4       Fetus   ERS739242
## 5          VM   ERS739246
## 6       Blood   ERS686652
## 7       Blood   ERS739261

Now french isolates and all its features are again available.