This is the main function to load a pagoo object. It's safer and more friendly than using pagoo's class constructors (PgR6, PgR6M, and PgR6MS). This function returns either a PgR6M class object, or a PgR6MS class object, depending on the parameters provided. If sequences are provided, it returns the latter. See below for more details.

pagoo(
data,
org_meta,
cluster_meta,
sequences,
core_level = 95,
sep = "__",
verbose = TRUE
)

## Arguments

### Arguments:

• map: character identifying the metadata to map. Can be one of "org", "group", or "gid".

• df: data.frame or DataFrame with the metadata to add. For each case, a column named as "map" must exists, which should contain identifiers for each element. In the case of adding gene (gid) metadata,each gene should be referenced by the name of the organism and the name of the gene as provided in the "data" data.frame, separated by the "sep" argument.

### Return:

self invisibly, but with additional metadata.

Drop an organism

### Return:

self invisibly, but with x recovered. It isn't necessary to assign the function call to a new object, nor to re-write it as R6 objects are mutable.

Write a pangenome as flat (text) files.

### Description:

Write the pangenome data as flat tables (text). Is not the most recommended way to save a pangenome, since you can loose information as numeric precision, column classes (factor, numeric, integer), and the state of the object itself (i.e. dropped organisms, or core_level), loosing reproducibility. Use save_pangenomeRDS for a more precise way of saving a pagoo object. Still, it is useful if you want to work with the data outside R, just keep the above in mind.

 $write_pangenome(dir = "pangenome", force = FALSE) ### Arguments: • dir: The unexisting directory name where to put the data files. Default is "pangenome". • force: logical. Whether to overwrite the directory if it already exists. Default: FALSE. ### Return: A directory with at least 3 files. "data.tsv" contain the basic pangenome data as it is provided to the data argument in the initialization method ($new(...)). "clusters.tsv" contain any metadata associated to the clusters. "organisms.tsv" contain any metadata associated to the organisms. The latter 2 files will contain a single column if no metadata was provided.

Save a pangenome as a RDS (binary) file.

### Description:

Save a pagoo pangenome object. This function provides a method for saving a pagoo object and its state into a "RDS" file. To load the pangenome, use the load_pangenomeRDS function in this package. It *should* be compatible between pagoo versions, so you could update pagoo and still recover the same pangenome. Even sep and core_level are restored unless the user provides those arguments in load_pangenomeRDS. dropped organisms also kept hidden, as you where working with the original object.

### Arguments:

• deep: character identifying the metadata to map. Can be one of "org", "group", or "gid".

### Return:

Whether to make a deep clone.

Compute distances

### Description:

Compute distance between all pairs of genomes. The default dist method is "bray" (Bray-Curtis distance). Another used distance method is "jaccard", but you should set binary = FALSE (see below) to obtain a meaningful result. See vegdist for details, this is just a wrapper function.

### Arguments:

• center: a logical value indicating whether the variables should be shifted to be zero centered. Alternately, a vector of length equal the number of columns of x can be supplied. The value is passed to scale.

• scale.: a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place. The default is TRUE.

• ...: Other arguments. See prcomp

### Return:

Returns a list with class "prcomp". See prcomp for more information.

Fit a Power Law Function for the Pangenome

### Description:

Fits a power law curve for the pangenome rarefaction simulation.

### Arguments:

• raref: (Optional) A rarefaction matrix, as returned by rarefact().

• pcounts: An integer of pseudo-counts. This is used to better fit the function at small numbers, as the linearization method requires to subtract a constant C, which is the coregenome size, from y. As y becomes closer to the coregenome size, this operation tends to 0, and its logarithm goes crazy. By default pcounts=10.

• ...: Further arguments to be passed to rarefact(). If raref is missing, it will be computed with default arguments, or with the ones provided here.

### Return:

A list of two elements: $formula with a fitted function, and $params with fitted intercept and decay parameters.

Compute Genomic Fluidity

### Description:

Computes the genomic fluidity, which is a measure of population diversity. See fluidity for more details.

### Return:

A barplot, and a gg object (ggplot2 package) invisibly.

Plot a Distance Heatmap

### Description:

Plot a heatmap showing the computed distance between all pairs of organisms.

### Return:

A binary map (ggplot2::geom_raster()), and a gg object (ggplot2 package) invisibly.

Plot a PCA

### Description:

Plot a scatter plot of a Principal Components Analysis.

 $gg_pca(colour = NULL, ...)) ### Arguments: • colour:The name of the column in $organisms field from which points will take color (if provided). NULL (default) renders black points.

• ...: More arguments to be passed to ggplot2::autoplot().

### Return:

A scatter plot (ggplot2::autoplot()), and a gg object (ggplot2 package) invisibly.

Plot a Pie with Pangenome Categories

### Description:

Plot a pie chart showing the number of clusters of each pangenome category: core, shell, or cloud.

### Arguments:

• what: "pangenome" and/or "coregenome".

• ...: ignored

### Return:

A scatter plot, and a gg object (ggplot2 package) invisibly.

Run a Shiny App

### Description:

Launch an interactive shiny app. It contains a sidebar with controls and switches to interact with the pagoo object. You can drop/recover organisms from the dataset, modify the core_level, visualize statistics, plots, and browse cluster and gene information. In the main body, it contains 2 tabs to switch between summary statistics plots and core genome information on one side, and accessory genome plots and information on the other.

The lower part of each tab contains two tables, side by side. On the "Summary" tab, the left one contain information about core clusters, with one cluster per row. When one of them is selected (click), the one on the right is updated to show information about its genes (if provided), one gene per row. On the "Accessory" tab, a similar configuration is shown, but on this case only accessory clusters/genes are displayed. There is a slider on the sidebar where one can select the accessory frequency range to display.

Give it a try!

Take into account that big pangenomes can slow down the performance of the app. More than 50-70 organisms often leads to a delay in the update of the plots/tables.

### Arguments:

• max_per_org: Maximum number of sequences of each organism to be taken from each cluster.

• fill: logical. If fill DNAStringSet with empty DNAString in cases where core_level is set below 100%, and some clusters with missing organisms are also considered.

### Return:

A DNAStringSetList with core genes. Order of organisms on each cluster is conserved, so it is easier to concatenate them into a super-gene suitable for phylogenetic inference.