A basic PgR6 class constructor. It contains basic fields
and subset functions to handle a pangenome. Final users should use pagoo
instead of this, since is more easy to understand.
Active bindings
pan_matrixThe panmatrix. Rows are organisms, and columns are groups of orthologous. Cells indicates the presence (>=1) or absence (0) of a given gene, in a given organism. Cells can have values greater than 1 if contain in-paralogs.
organismsA
DataFramewith available organism names, and organism number identifier asrownames(). (Dropped organisms will not be displayed in this field, see$droppedbelow). Additional metadata will be shown if provided, as additional columns.genesA
SplitDataFrameListobject with one entry per cluster. Each element contains aDataFramewith gene ids (<gid>) and additional metadata, if provided.gidare created bypasteing organism and gene names, so duplication in gene names are avoided.clustersA
DataFramewith the groups of orthologous (clusters). Additional metadata will be shown as additional columns, if provided before. Each row corresponds to each cluster.core_levelThe percentage of organisms a gene must be in to be considered as part of the coregenome.
core_level = 95by default. Can't be set above 100, and below 85 raises a warning.core_genesLike
genes, but only showing core genes.core_clustersLike
$clusters, but only showing core clusters.cloud_genesLike
genes, but only showing cloud genes. These are defined as those clusters which contain a single gene (singletons), plus those which have more than one but its organisms are probably clonal due to identical general gene content. Colloquially defined as strain-specific genes.cloud_clustersLike
$clusters, but only showing cloud clusters as defined above.shell_genesLike
genes, but only showing shell genes. These are defined as those clusters than don't belong neither to the core genome, nor to cloud genome. Colloquially defined as genes that are present in some but not all strains, and that aren't strain-specific.shell_clustersLike
$clusters, but only showing shell clusters, as defined above.summary_statsA
DataFramewith information about the number of core, shell, and cloud clusters, as well as the total number of clusters.random_seedThe last
.Random.seed. Used for reproducibility purposes only.droppedA
charactervector with dropped organism names, and organism number identifier asnames()
Methods
Method new()
A basic PgR6 class constructor. It contains basic fields
and subset functions to handle a pangenome.
Usage
PgR6$new(
data,
org_meta,
cluster_meta,
core_level = 95,
sep = "__",
verbose = TRUE,
DF,
group_meta
)Arguments
dataA
data.frameorDataFramecontaining at least the following columns:gene(gene name),org(organism name to which the gene belongs to), andcluster(group of orthologous to which the gene belongs to). More columns can be added as metadata for each gene.org_meta(optional) A
data.frameorDataFramecontaining additional metadata for organisms. Thisdata.framemust have a column named "org" with valid organisms names (that is, they should match with those provided indata, columnorg), and additional columns will be used as metadata. Each row should correspond to each organism.cluster_meta(optional) A
data.frameorDataFramecontaining additional metadata for clusters. Thisdata.framemust have a column named "cluster" with valid organisms names (that is, they should match with those provided indata, columncluster), and additional columns will be used as metadata. Each row should correspond to each cluster.core_levelThe initial core_level (that's the percentage of organisms a core cluster must be in to be considered as part of the core genome). Must be a number between 100 and 85, (default: 95). You can change it later by using the
$core_levelfield once the object was created.sepA separator. By default is '__'(two underscores). It will be used to create a unique
gid(gene identifier) for each gene.gids are created by pastingorgtogene, separated bysep.verboselogical. Whether to display progress messages when loading class.DFDeprecated. Use
datainstead.group_metaDeprecated. Use
cluster_metainstead.
Method add_metadata()
Add metadata to the object. You can add metadata to each organism, to each
group of orthologous (cluster), or to each gene. Elements with missing data should be filled
by NA (dimensions of the provided data.frame must be coherent with object
data).
Arguments
mapcharacteridentifying the metadata to map. Can be one of"org","cluster", or"gid".datadata.frameorDataFramewith the metadata to add. For each case, a column named as"map"must exists, which should contain identifiers for each element. In the case of adding gene (gid) metadata,each gene should be referenced by the name of the organism and the name of the gene as provided in the"data"data.frame, separated by the"sep"argument.
Method drop()
Drop an organism from the dataset. This method allows to hide an organism
from the real dataset, ignoring it in downstream analyses. All the fields and
methods will behave as it doesn't exist. For instance, if you decide to drop
organism 1, the $pan_matrix field (see below) would not show it when
called.
Method recover()
Recover a previously $drop()ped organism (see above). All fields
and methods will start to behave considering this organism again.
Method write_pangenome()
Write the pangenome data as flat tables (text). Is not the most recommended way
to save a pangenome, since you can loose information as numeric precision,
column classes (factor, numeric, integer), and the state of the object itself
(i.e. dropped organisms, or core_level), loosing reproducibility. Use
$save_pangenomeRDS for a more precise way of saving a pagoo object.
Still, it is useful if you want to work with the data outside R, just keep
the above in mind.
Arguments
dirThe non-existing directory name where to put the data files. Default is "pangenome".
forcelogical. Whether to overwrite the directory if it already exists. Default:FALSE.
Returns
A directory with at least 3 files. "data.tsv" contain the basic
pangenome data as it is provided to the data argument in the
initialization method ($new(...)). "clusters.tsv" contain any metadata
associated to the clusters. "organisms.tsv" contain any metadata associated to
the organisms. The latter 2 files will contain a single column if no metadata
was provided.
Method save_pangenomeRDS()
Save a pagoo pangenome object. This function provides a method for saving a pagoo
object and its state into a "RDS" file. To load the pangenome, use the
load_pangenomeRDS function in this package. It *should* be compatible between
pagoo versions, so you could update pagoo and still recover the same pangenome. Even
sep and core_level are restored unless the user provides those
arguments in load_pangenomeRDS. dropped organisms also kept hidden, as
you where working with the original object.