This is the main function to load a pagoo object. It's safer and
more friendly than using pagoo's class constructors (PgR6,
PgR6M, and PgR6MS). This function returns either a
PgR6M
class object, or a PgR6MS
class object,
depending on the parameters provided. If sequences are provided, it returns
the latter. See below for more details.
Arguments
- data
A
data.frame
orDataFrame
containing at least the following columns:gene
(gene name),org
(organism name to which the gene belongs to), andcluster
(group of orthologous to which the gene belongs to). More columns can be added as metadata for each gene.- org_meta
(optional) A
data.frame
orDataFrame
containing additional metadata for organisms. Thisdata.frame
must have a column named "org" with valid organisms names (that is, they should match with those provided indata
, columnorg
), and additional columns will be used as metadata. Each row should correspond to each organism.- cluster_meta
(optional) A
data.frame
orDataFrame
containing additional metadata for clusters. Thisdata.frame
must have a column named "cluster" with valid organisms names (that is, they should match with those provided indata
, columncluster
), and additional columns will be used as metadata. Each row should correspond to each cluster.- sequences
(optional) Can accept: 1) a named
list
of namedcharacter
vector. Name of list are names of organisms, names of character vector are gene names; or 2) a namedlist
ofDNAStringSetList
objects (same requirements as (1), but with BStringSet names as gene names); or 3) aDNAStringSetList
(same requirements as (2) butDNAStringSetList
names are organisms names). If this parameter is used, then aPgR6MS
class object is returned.- core_level
The initial core_level (that's the percentage of organisms a core cluster must be in to be considered as part of the core genome). Must be a number between 100 and 85, (default: 95). You can change it later by using the
$core_level
field once the object was created.- sep
A separator. By default is '__'(two underscores). It will be used to create a unique
gid
(gene identifier) for each gene.gid
s are created by pastingorg
togene
, separated bysep
.- verbose
logical
. Whether to display progress messages when loading class.
Details
This package uses [R6](https://r6.r-lib.org/articles/Introduction.html) classes to provide a unified, comprehensive, standardized, but at the same time flexible, way to analyze a pangenome. The idea is to have a single object which contains both the data and the basic methods to analyze them, as well as manipulate fields, explore, and to use in harmony with the already existing and extensive list of R packages available created for comparative genomics and genetics.
For more information, tutorials, and resources, please visit https://iferres.github.io/pagoo/ .
Methods
-
Below is a comprehensive description of all the methods provided by the object.
html
- Add metadata
Description:
Add metadata to the object. You can add metadata to each organism, to each group of orthologous, or to each gene. Elements with missing data should be filled by
NA
(dimensions of the provided data.frame must be coherent with object data).Arguments:
map
:character
identifying the metadata to map. Can be one of"org"
,"group"
, or"gid"
.df
:data.frame
orDataFrame
with the metadata to add. For each case, a column named as"map"
must exists, which should contain identifiers for each element. In the case of adding gene (gid
) metadata,each gene should be referenced by the name of the organism and the name of the gene as provided in the"data"
data.frame, separated by the"sep"
argument.
html
- Drop an organism
Description:
Drop an organism from the dataset. This method allows to hide an organism from the real dataset, ignoring it in downstream analyses. All the fields and methods will behave as it doesn't exist. For instance, if you decide to drop organism 1, the
$pan_matrix
field (see below) would not show it when called.
html
- Recover a dropped organism
Description:
Recover a previously
$drop()
ped organism (see above). All fields and methods will start to behave considering this organism again.
html
- Write a pangenome as flat (text) files.
Description:
Write the pangenome data as flat tables (text). Is not the most recommended way to save a pangenome, since you can loose information as numeric precision, column classes (factor, numeric, integer), and the state of the object itself (i.e. dropped organisms, or core_level), loosing reproducibility. Use
save_pangenomeRDS
for a more precise way of saving a pagoo object. Still, it is useful if you want to work with the data outside R, just keep the above in mind.Arguments:
dir
: The unexisting directory name where to put the data files. Default is "pangenome".force
:logical
. Whether to overwrite the directory if it already exists. Default:FALSE
.
Return:
A directory with at least 3 files. "data.tsv" contain the basic pangenome data as it is provided to the
data
argument in the initialization method ($new(...)
). "clusters.tsv" contain any metadata associated to the clusters. "organisms.tsv" contain any metadata associated to the organisms. The latter 2 files will contain a single column if no metadata was provided.
html
- Save a pangenome as a RDS (binary) file.
Description:
Save a pagoo pangenome object. This function provides a method for saving a pagoo object and its state into a "RDS" file. To load the pangenome, use the
load_pangenomeRDS
function in this package. It *should* be compatible between pagoo versions, so you could update pagoo and still recover the same pangenome. Evensep
andcore_level
are restored unless the user provides those arguments inload_pangenomeRDS
.dropped
organisms also kept hidden, as you where working with the original object.
html
- Clone a pagoo object.
- html
- Compute distances
Description:
Compute distance between all pairs of genomes. The default dist method is
"bray"
(Bray-Curtis distance). Another used distance method is"jaccard"
, but you should setbinary = FALSE
(see below) to obtain a meaningful result. Seevegdist
for details, this is just a wrapper function.Arguments:
method
: The distance method to use. See vegdist for available methods, and details for each one.binary
: Transform abundance matrix into a presence/absence matrix before computing distance.diag
: Compute diagonals.upper
: Return only the upper diagonal.na.rm
: Pairwise deletion of missing observations when computing dissimilarities....
: Other parameters. See vegdist for details.
html
- Compute a Principal Component Analysis
-
Arguments:
center
: a logical value indicating whether the variables should be shifted to be zero centered. Alternately, a vector of length equal the number of columns of x can be supplied. The value is passed to scale.scale.
: a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place. The default is TRUE....
: Other arguments. See prcomp
Return:
Returns a list with class "prcomp". See prcomp for more information.
html
- Fit a Power Law Function for the Pangenome
- html
- Fit an Exponential Decay Function for the Coregenome
-
Arguments:
raref
: (Optional) A rarefaction matrix, as returned byrarefact()
.pcounts
: An integer of pseudo-counts. This is used to better fit the function at small numbers, as the linearization method requires to subtract a constant C, which is the coregenome size, fromy
. Asy
becomes closer to the coregenome size, this operation tends to 0, and its logarithm goes crazy. By defaultpcounts=10
....
: Further arguments to be passed torarefact()
. Ifraref
is missing, it will be computed with default arguments, or with the ones provided here.
html
- Compute Genomic Fluidity
Description:
Computes the genomic fluidity, which is a measure of population diversity. See
fluidity
for more details.
html
- Plot Accessory Frequency Plot
- html
- Plot a Distance Heatmap
-
Arguments:
method
: Distance method. One of "Jaccard" (default), or "Manhattan",see above....
: More arguments to be passed todistManhattan
Return:
A heatmap (
ggplot2::geom_tile()
), and agg
object (ggplot2
package) invisibly.
html
- Plot a Pangenome Binary Map
Description:
Plot a pangenome binary map representing the presence/absence of each gene within each organism.
Return:
A binary map (
ggplot2::geom_raster()
), and agg
object (ggplot2
package) invisibly.
html
- Plot a PCA
-
Arguments:
colour
:The name of the column in$organisms
field from which points will take color (if provided).NULL
(default) renders black points....
: More arguments to be passed toggplot2::autoplot()
.
Return:
A scatter plot (
ggplot2::autoplot()
), and agg
object (ggplot2
package) invisibly.
html
- Plot a Pie with Pangenome Categories
- html
- Plot Pangenome Curves
- html
- Run a Shiny App
Description:
Launch an interactive shiny app. It contains a sidebar with controls and switches to interact with the pagoo object. You can drop/recover organisms from the dataset, modify the core_level, visualize statistics, plots, and browse cluster and gene information. In the main body, it contains 2 tabs to switch between summary statistics plots and core genome information on one side, and accessory genome plots and information on the other.
The lower part of each tab contains two tables, side by side. On the "Summary" tab, the left one contain information about core clusters, with one cluster per row. When one of them is selected (click), the one on the right is updated to show information about its genes (if provided), one gene per row. On the "Accessory" tab, a similar configuration is shown, but on this case only accessory clusters/genes are displayed. There is a slider on the sidebar where one can select the accessory frequency range to display.
Give it a try!
Take into account that big pangenomes can slow down the performance of the app. More than 50-70 organisms often leads to a delay in the update of the plots/tables.
html
- Retrieve Core Genes for Phylogeny
Description:
A field for obtaining core gene sequences is available (see below), but for creating a phylogeny with this sets is useful to: 1) have the possibility of extracting just one sequence of each organism on each cluster, in case paralogues are present, and 2) filling gaps with empty sequences in case the core_level was set below 100%, allowing more genes (some not in 100% of organisms) to be incorporated to the phylogeny. That is the purpose of this special function.