Package 'TDApplied' reference manual

Title:	Machine Learning and Inference for Topological Data Analysis
Description:	Topological data analysis is a powerful tool for finding non-linear global structure in whole datasets. The main tool of topological data analysis is persistent homology, which computes a topological shape descriptor of a dataset called a persistence diagram. 'TDApplied' provides useful and efficient methods for analyzing groups of persistence diagrams with machine learning and statistical inference, and these functions can also interface with other data science packages to form flexible and integrated topological data analysis pipelines.
Authors:	Shael Brown [aut, cre], Dr. Reza Farivar [aut, fnd]
Maintainer:	Shael Brown <[email protected]>
License:	GPL (>= 3)
Version:	3.0.4
Built:	2025-03-27 03:50:56 UTC
Source:	https://github.com/shaelebrown/tdapplied

Analyze the data point memberships of multiple representative (co)cycles.

Description

Multiple distance matrices with corresponding data points can contain the same topological features. Therefore we may wish to compare many representative (co)cycles across distance matrices to decide if their topological features are the same. The 'analyze_representatives' function returns a matrix of binary datapoint memberships in an input list of representatives across distance matrices. Optionally this matrix can be plotted as a heatmap with columns as data points and rows (i.e. representatives) reordered by similarity, and the contributions (i.e. percentage membership) of each point in the representatives can also be returned. The heatmap has dark red squares representing membership - location [i,j] is dark red if data point j is in representative i.

Usage

analyze_representatives(
  diagrams,
  dim,
  num_points,
  plot_heatmap = TRUE,
  return_contributions = FALSE,
  boxed_reps = NULL,
  d = NULL,
  lwd = NULL,
  title = NULL,
  return_clust = FALSE
)
analyze_representatives(
  diagrams,
  dim,
  num_points,
  plot_heatmap = TRUE,
  return_contributions = FALSE,
  boxed_reps = NULL,
  d = NULL,
  lwd = NULL,
  title = NULL,
  return_clust = FALSE
)

Arguments

`diagrams`	a list of persistence diagrams, either the output of persistent homology calculations like ripsDiag/`calculate_homology`/`PyH`, `diagram_to_df` or `bootstrap_persistence_thresholds`.
`dim`	the integer homological dimension of representatives to consider.
`num_points`	the integer number of data points in all the original datasets (from which the diagrams were calculated).
`plot_heatmap`	a boolean representing if a heatmap of data point membership similarity of the representatives should be plotted, default 'TRUE'. A dendrogram of hierarchical clustering is plotted, and rows (representatives) are sorted according to this clustering.
`return_contributions`	a boolean indicating whether or not to return the membership contributions (i.e. percentages) of the data points (1:'num_points') across all the representatives, default 'FALSE'.
`boxed_reps`	a data frame specifying specific rows of the output heatmap which should have a box drawn around them (for highlighting), default NULL. See the details section for more information.
`d`	either NULL (default) or a "dist" object representing a distance matrix for the representatives, which must have the same number of rows and columns as cycles in the dimension 'dim'.
`lwd`	a positive number width for the lines of drawn boxes, if boxed_reps is not null.
`title`	a character string title for the plotted heatmap, default NULL.
`return_clust`	a boolean determining whether or not to return the result of the 'stats::hclust()' call when a heatmap is plotted, default 'FALSE'.

Details

The clustering dendrogram can be used to determine if there are any similar groups of representatives (i.e. shared topological features across datasets) and if so how many. The row labels of the heatmap are of the form 'DX[Y]', meaning the Yth representative of diagram X, and the column labels are the data point numbers. If diagrams are the output of the bootstrap_persistence_thresholds function, then the subsetted_representatives (if present) will be analyzed. Therefore, a column label like 'DX[Y]' in the plotted heatmap would mean the Yth representative of diagram X. If certain representatives should be highlighted (by drawing a box around its row) in the heatmap, a dataframe ‘boxed_reps' can be supplied with two integer columns - ’diagram' and 'rep'. For example, if we wish to draw a box for DX[Y] then we add the row (diagram = X,rep = Y) to 'boxed_reps'. If 'd' is supplied then it will be used to cluster the representatives, based on the distances in 'd'.

Value

either a matrix of data point contributions to the representatives, or a list with elements "memberships" (the matrix) and some combination of elements "contributions" (a vector of membership percentages for each data point across representatives) and "clust" (the results of 'stats::hclust()' on the membership matrix).

`X`	the input dataset, must either be a matrix or data frame.
`FUN_diag`	a string representing the persistent homology function to use for calculating the full persistence diagram, either 'calculate_homology' (the default), 'PyH' or 'ripsDiag'.
`FUN_boot`	a string representing the persistent homology function to use for calculating the bootstrapped persistence diagrams, either 'calculate_homology' (the default), 'PyH' or 'ripsDiag'.
`maxdim`	the integer maximum homological dimension for persistent homology, default 0.
`thresh`	the positive numeric maximum radius of the Vietoris-Rips filtration.
`distance_mat`	a boolean representing if 'X' is a distance matrix (TRUE) or not (FALSE, default). dimensions together (TRUE, the default) or if one threshold should be calculated for each dimension separately (FALSE).
`ripser`	the imported ripser module when 'FUN_diag' or 'FUN_boot' is 'PyH'.
`ignore_infinite_cluster`	a boolean indicating whether or not to ignore the infinitely lived cluster when 'FUN_diag' or 'FUN_boot' is 'PyH'.
`calculate_representatives`	a boolean representing whether to calculate representative (co)cycles, default FALSE. Note that representatives cant be calculated when using the 'calculate_homology' function.
`num_samples`	the positive integer number of bootstrap samples, default 30.
`alpha`	the type-1 error threshold, default 0.05.
`return_subsetted`	a boolean representing whether or not to return the subsetted persistence diagram (with or without representatives), default FALSE.
`return_pvals`	a boolean representing whether or not to return p-values for features in the subsetted diagram, default FALSE.
`return_diag`	a boolean representing whether or not to return the calculated persistence diagram, default TRUE.
`num_workers`	the integer number of cores used for parallelizing (over bootstrap samples), default one less the maximum amount of cores on the machine.
`p_less_than_alpha`	a boolean representing whether or not subset further and return only feature whose p-values are strictly less than 'alpha', default 'FALSE'. Note that this is not part of the original bootstrap procedure.
`...`	additional parameters for internal methods.

`D1`	the first persistence diagram.
`D2`	the second persistence diagram.
`dim`	the non-negative integer homological dimension in which the distance is to be computed, default 0.
`p`	a number representing the wasserstein power parameter, at least 1 and default 2.
`distance`	a string which determines which type of distance calculation to carry out, either "wasserstein" (default) or "fisher".
`sigma`	either NULL (default) or a positive number representing the bandwidth for the Fisher information metric.
`rho`	either NULL (default) or a positive number. If NULL then the exact calculation of the Fisher information metric is returned and otherwise a fast approximation, see details.

`diagrams`	a list of n>=2 persistence diagrams which are either the output of a persistent homology calculation like ripsDiag/`calculate_homology`/`PyH`, or the `diagram_to_df` function.
`K`	an optional precomputed Gram matrix of persistence diagrams, default NULL.
`centers`	number of clusters to initialize, no more than the number of diagrams although smaller values are recommended.
`dim`	the non-negative integer homological dimension in which the distance is to be computed, default 0.
`t`	a positive number representing the scale for the persistence Fisher kernel, default 1.
`sigma`	a positive number representing the bandwidth for the Fisher information metric, default 1.
`rho`	an optional positive number representing the heuristic for Fisher information metric approximation, see `diagram_distance`. Default NULL. If supplied, Gram matrix calculation is sequential.
`num_workers`	the number of cores used for parallel computation, default is one less than the number of cores on the machine.
`...`	additional parameters for the `kkmeans` kernlab function.

`diagrams`	a list of persistence diagrams which are either the output of a persistent homology calculation like ripsDiag/`calculate_homology`/`PyH`, or `diagram_to_df`.
`K`	an optional precomputed Gram matrix of the persistence diagrams in 'diagrams', default NULL.
`dim`	the non-negative integer homological dimension in which the distance is to be computed, default 0.
`t`	a positive number representing the scale for the persistence Fisher kernel, default 1.
`sigma`	a positive number representing the bandwidth for the Fisher information metric, default 1.
`rho`	an optional positive number representing the heuristic for Fisher information metric approximation, see `diagram_distance`. Default NULL. If supplied, Gram matrix calculation is sequential.
`features`	number of features (principal components) to return, default 1.
`num_workers`	the number of cores used for parallel computation, default is one less than the number of cores on the machine.
`th`	the threshold value under which principal components are ignored (default 0.0001).

`diagrams`	a list of persistence diagrams which are either the output of a persistent homology calculation like ripsDiag/`calculate_homology`/`PyH`, or `diagram_to_df`.
`cv`	a positive number at most the length of 'diagrams' which determines the number of cross validation splits to be performed (default 1, aka no cross-validation). If 'prob.model' is TRUE then cv is set to 1 since kernlab performs 3-fold CV internally in this case. When performing classification, classes are balanced within each cv fold.
`dim`	a non-negative integer vector of homological dimensions in which the model is to be fit.
`t`	either a vector of positive numbers representing the grid of values for the scale of the persistence Fisher kernel or NULL, default 1. If NULL then t is selected automatically, see details.
`sigma`	a vector of positive numbers representing the grid of values for the bandwidth of the Fisher information metric, default 1.
`rho`	an optional positive number representing the heuristic for Fisher information metric approximation, see `diagram_distance`. Default NULL. If supplied, distance matrix calculations are sequential.
`y`	a response vector with one label for each persistence diagram. Must be either numeric or factor, but doesn't need to be supplied when 'type' is "one-svc".
`type`	a string representing the type of task to be performed. Can be any one of "C-svc","nu-svc","one-svc","eps-svr","nu-svr" - default for regression is "eps-svr" and for classification is "C-svc". See `ksvm` for details.
`distance_matrices`	an optional list of precomputed Fisher distance matrices, corresponding to the rows in 'expand.grid(dim = dim,sigma = sigma)', default NULL.
`C`	a number representing the cost of constraints violation (default 1) this is the 'C'-constant of the regularization term in the Lagrange formulation.
`nu`	numeric parameter needed for nu-svc, one-svc and nu-svr. The 'nu' parameter sets the upper bound on the training error and the lower bound on the fraction of data points to become Support Vector (default 0.2).
`epsilon`	epsilon in the insensitive-loss function used for eps-svr, nu-svr and eps-bsvm (default 0.1).
`prob.model`	if set to TRUE builds a model for calculating class probabilities or in case of regression, calculates the scaling parameter of the Laplacian distribution fitted on the residuals. Fitting is done on output data created by performing a 3-fold cross-validation on the training data. For details see references (default FALSE).
`class.weights`	a named vector of weights for the different classes, used for asymmetric class sizes. Not all factor levels have to be supplied (default weight: 1). All components have to be named.
`fit`	indicates whether the fitted values should be computed and included in the model or not (default TRUE).
`cache`	cache memory in MB (default 40).
`tol`	tolerance of termination criteria (default 0.001).
`shrinking`	option whether to use the shrinking-heuristics (default TRUE).
`num_workers`	the number of cores used for parallel computation, default is one less the number of cores on the machine.

`diagrams`	a list of n>=2 persistence diagrams which are either the output of a persistent homology calculation like ripsDiag/`calculate_homology`/`PyH`, or `diagram_to_df`. Only one of 'diagrams' and 'D' need to be supplied.
`D`	an optional precomputed distance matrix of persistence diagrams, default NULL. If not NULL then 'diagrams' parameter does not need to be supplied.
`k`	the dimension of the space which the data are to be represented in; must be in {1,2,...,n-1}.
`distance`	a string representing the desired distance metric to be used, either 'wasserstein' (default) or 'fisher'.
`dim`	the non-negative integer homological dimension in which the distance is to be computed, default 0.
`p`	a positive number representing the wasserstein power, a number at least 1 (infinity for the bottleneck distance), default 2.
`sigma`	a positive number representing the bandwidth for the Fisher information metric, default NULL.
`rho`	an optional positive number representing the heuristic for Fisher information metric approximation, see `diagram_distance`. Default NULL. If supplied, distance matrix calculation is sequential.
`eig`	a boolean indicating whether the eigenvalues should be returned.
`add`	a boolean indicating if an additive constant c* should be computed, and added to the non-diagonal dissimilarities such that the modified dissimilarities are Euclidean.
`x.ret`	a boolean indicating whether the doubly centered symmetric distance matrix should be returned.
`list.`	a boolean indicating if a list should be returned or just the n*k matrix.
`num_workers`	the number of cores used for parallel computation, default is one less than the number of cores on the machine.

`diagrams`	a list of persistence diagrams, either the output of persistent homology calculations like ripsDiag/`calculate_homology`/`PyH`, or `diagram_to_df`.
`other_diagrams`	either NULL (default) or another list of persistence diagrams to compute a cross-distance matrix.
`dim`	the non-negative integer homological dimension in which the distance is to be computed, default 0.
`distance`	a character determining which metric to use, either "wasserstein" (default) or "fisher".
`p`	a number representing the wasserstein power parameter, at least 1 and default 2.
`sigma`	a positive number representing the bandwidth of the Fisher information metric, default NULL.
`rho`	an optional positive number representing the heuristic for Fisher information metric approximation, see `diagram_distance`. Default NULL. If not NULL then matrix is calculated sequentially, but functions in the "exec" directory of the package can be loaded to calculate distance matrices in parallel with approximation.
`num_workers`	the number of cores used for parallel computation, default is one less than the number of cores on the machine.

`X`	the input dataset, must either be a matrix or data frame.
`distance_mat`	whether or not 'X' is a distance matrix, default FALSE.

`diagrams`	a list of persistence diagrams, where each diagram is either the output of a persistent homology calculation like ripsDiag/`calculate_homology`/`PyH`, or `diagram_to_df`.
`other_diagrams`	either NULL (default) or another list of persistence diagrams to compute a cross-Gram matrix.
`dim`	the non-negative integer homological dimension in which the distance is to be computed, default 0.
`sigma`	a positive number representing the bandwidth for the Fisher information metric, default 1.
`t`	a positive number representing the scale for the kernel, default 1.
`rho`	an optional positive number representing the heuristic for Fisher information metric approximation, see `diagram_distance`. Default NULL. If supplied, code execution is sequential, but functions in the "exec" directory of the package can be loaded to calculate distance matrices in parallel with approximation.
`num_workers`	the number of cores used for parallel computation, default is one less than the number of cores on the machine.

`g1`	the first group of persistence diagrams, where each diagram was either the output from a persistent homology calculation like ripsDiag/`calculate_homology`/`PyH`, or `diagram_to_df`.
`g2`	the second group of persistence diagrams, where each diagram was either the output from a persistent homology calculation like ripsDiag/`calculate_homology`/`PyH`, or `diagram_to_df`.
`dims`	a non-negative integer vector of the homological dimensions in which the test is to be carried out, default c(0,1).
`sigma`	a positive number representing the bandwidth for the Fisher information metric, default 1.
`rho`	an optional positive number representing the heuristic for Fisher information metric approximation, see `diagram_distance`. Default NULL. If supplied, calculation of Gram matrices is sequential.
`t`	a positive number representing the scale for the persistence Fisher kernel, default 1.
`num_workers`	the number of cores used for parallel computation, default is one less than the number of cores on the machine.
`verbose`	a boolean flag for if the time duration of the function call should be printed, default FALSE
`Ks`	an optional list of precomputed Gram matrices for the first group of diagrams, with one element for each dimension. If not NULL and 'Ls' is not NULL then 'g1' and 'g2' do not need to be supplied.
`Ls`	an optional list of precomputed Gram matrices for the second group of diagrams, with one element for each dimension. If not NULL and 'Ks' is not NULL then 'g1' and 'g2' do not need to be supplied.

`D1`	the first dataset (a data frame).
`D2`	the second dataset (a data frame).
`iterations`	the number of iterations for permuting group labels, default 20.
`num_samples`	the number of bootstrap iterations, default 30.
`dims`	a non-negative integer vector of the homological dimensions in which the test is to be carried out, default c(0,1).
`samp`	an optional list of row-number samples of 'D1', default NULL. See details and examples for more information. Ignored when 'paired' is FALSE.
`paired`	a boolean flag for if there is a second-order pairing between diagrams at the same index in different groups, default FALSE.
`num_workers`	the number of cores used for parallel computation, default is one less than the number of cores on the machine.
`verbose`	a boolean flag for if the time duration of the function call should be printed, default FALSE
`FUN_boot`	a string representing the persistent homology function to use for calculating the bootstrapped persistence diagrams, either 'calculate_homology' (the default), 'PyH' or 'ripsDiag'.
`thresh`	the positive numeric maximum radius of the Vietoris-Rips filtration.
`distance_mat`	a boolean representing if 'X' is a distance matrix (TRUE) or not (FALSE, default). dimensions together (TRUE, the default) or if one threshold should be calculated for each dimension separately (FALSE).
`ripser`	the imported ripser module when 'FUN_boot' is 'PyH'.
`return_diagrams`	whether or not to return the two lists of bootstrapped persistence diagrams, default FALSE.

`...`	lists of persistence diagrams which are either the output of persistent homology calculations like ripsDiag/`calculate_homology`/`PyH`, or `diagram_to_df`. Each list must contain at least 2 diagrams.
`iterations`	the number of iterations for permuting group labels, default 20.
`p`	a positive number representing the wasserstein power parameter, a number at least 1 (and Inf if using the bottleneck distance) and default 2.
`q`	a finite number at least 1 for exponentiation in the Turner loss function, default 2.
`dims`	a non-negative integer vector of the homological dimensions in which the test is to be carried out, default c(0,1).
`dist_mats`	an optional list of precomputed distances matrices, one for each dimension, where the rows and columns would correspond to the unlisted groups of diagrams (in order), default NULL. If not NULL then no lists of diagrams need to be supplied.
`group_sizes`	a vector of group sizes, one for each group, when 'dist_mats' is not NULL.
`paired`	a boolean flag for if there is a second-order pairing between diagrams at the same index in different groups, default FALSE
`distance`	a string which determines which type of distance calculation to carry out, either "wasserstein" (default) or "fisher".
`sigma`	the positive bandwidth for the Fisher information metric, default NULL.
`rho`	an optional positive number representing the heuristic for Fisher information metric approximation, see `diagram_distance`. Default NULL. If supplied, code execution is sequential.
`num_workers`	the number of cores used for parallel computation, default is one less than the number of cores on the machine.
`verbose`	a boolean flag for if the time duration of the function call should be printed, default FALSE

`D`	a persistence diagram, either outputted from either a persistent homology homology calculation like ripsDiag/`calculate_homology`/`PyH` or from `diagram_to_df`, with maximum dimension at most 12.
`title`	the character string plot title, default NULL.
`max_radius`	the x and y limits of the plot are defined as 'c(0,max_radius)', and the default value of 'max_radius' is the maximum death value in 'D'.
`legend`	a logical indicating whether to include a legend of feature dimensions, default TRUE.
`thresholds`	either a numeric vector with one persistence threshold for each dimension in 'D' or the output of a `bootstrap_persistence_thresholds` function call, default NULL.

`graphs`	the output of a 'vr_graphs' function call.
`eps`	the numeric radius of the graph in 'graphs' to plot.
`cols`	an optional character vector of vertex colors, default 'NULL'.
`layout`	an optional 2D matrix of vertex coordinates, default 'NULL'. If row names are supplied they can be used to subset a graph by those vertex names.
`title`	an optional str title for the plot, default 'NULL'.
`component_of`	a vertex name (integer or character), only the component of the graph containing that vertex will be plotted (useful for identifying representative (co)cycles in graphs). Default 'NULL' (plot the whole graph).
`plot_isolated_vertices`	a boolean representing whether or not to plot isolated vertices, default 'FALSE'.
`return_layout`	a boolean representing whether or not to return the plotting layout (x-y coordinates of each vertex) and the vertex labels, default 'FALSE'.
`vertex_labels`	a boolean representing whether or not to plot vertex labels, default 'TRUE'.

`new_diagrams`	a list of persistence diagrams which are either the output of a persistent homology calculation like ripsDiag/`calculate_homology`/`PyH`, or `diagram_to_df`. Only one of 'new_diagrams' and 'K' need to be supplied.
`K`	an optional precomputed cross Gram matrix of the new diagrams and the diagrams used in 'clustering', default NULL. If not NULL then 'new_diagrams' does not need to be supplied.
`clustering`	the output of a `diagram_kkmeans` function call, of class 'diagram_kkmeans'.
`num_workers`	the number of cores used for parallel computation, default is one less than the number of cores on the machine.

`new_diagrams`	a list of persistence diagrams which are either the output of a persistent homology calculation like ripsDiag/`calculate_homology`/`PyH`, or `diagram_to_df`. Only one of 'new_diagrams' and 'K' need to be supplied.
`K`	an optional precomputed cross-Gram matrix of the new diagrams and the ones used in 'embedding', default NULL. If not NULL then 'new_diagrams' does not need to be supplied.
`embedding`	the output of a `diagram_kpca` function call, of class 'diagram_kpca'.
`num_workers`	the number of cores used for parallel computation, default is one less than the number of cores on the machine.

`new_diagrams`	a list of persistence diagrams which are either the output of a persistent homology calculation like ripsDiag/`calculate_homology`/`PyH`, or `diagram_to_df`. Only one of 'new_diagrams' and 'K' need to be supplied.
`model`	the output of a `diagram_ksvm` function call, of class 'diagram_ksvm'.
`K`	an optional cross-Gram matrix of the new diagrams and the diagrams in 'model', default NULL. If not NULL then 'new_diagrams' does not need to be supplied.
`num_workers`	the number of cores used for parallel computation, default is one less than the number of cores on the machine.

`X`	either a matrix or dataframe, representing either point cloud data or a distance matrix. In either case there must be at least two rows and 1 column.
`maxdim`	the non-negative integer maximum dimension for persistent homology, default 1.
`thresh`	the non-negative numeric radius threshold for the Vietoris-Rips filtration.
`distance_mat`	a boolean representing whether the input X is a distance matrix or not, default FALSE.
`ripser`	the ripser python module.
`ignore_infinite_cluster`	a boolean representing whether to remove clusters (0 dimensional cycles) which die at the threshold value. Default is TRUE as this is the default for TDAstats homology calculations, but can be set to FALSE which is the default for python ripser.
`calculate_representatives`	a boolean representing whether to return a list of representative cocycles for the topological features found in the persistence diagram, default FALSE.

`X`	either a point cloud data frame/matrix, or a distance matrix.
`distance_mat`	a boolean representing if the input 'X' is a distance matrix, default value is 'FALSE'.
`eps`	a numeric vector of the positive scales at which to compute the Rips-Vietoris complexes, i.e. all edges at most the specified values.
`return_clusters`	a boolean determining if the connected components (i.e. data clusters) of the complex should be explicitly returned, default is 'TRUE'.

Package 'TDApplied'

Help Index

Analyze the data point memberships of multiple representative (co)cycles.

Description

Usage

Arguments

Details

Value

Author(s)

Estimate persistence threshold(s) for topological features in a data set using bootstrapping.

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Make sure that python has been configured correctly for persistent homology calculations.

Description

Usage

Details

Author(s)

Verify an imported ripser module.

Description

Usage

Arguments

Author(s)

Calculate distance between a pair of persistence diagrams.

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Calculate persistence Fisher kernel value between a pair of persistence diagrams.

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Cluster a group of persistence diagrams using kernel k-means.

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Calculate the kernel PCA embedding of a group of persistence diagrams.

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Fit a support vector machine model where each training set instance is a persistence diagram.

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Dimension reduction of a group of persistence diagrams via metric multidimensional scaling.

Description