Topological data analysis is a relatively new area of data science which can compare and contrast data sets via non-linear global structure. The main tool of topological data analysis, persistent homology (Edelsbrunner, Letscher, and Zomorodian 2000; Zomorodian and Carlsson 2005), builds on techniques from the field of algebraic topology to describe shape features present in a data set (stored in a “persistence diagram”). Persistent homology has been used in a number of applications, including
For a broad introduction to the mathematical background and main tools of topological data analysis with applications, see (Chazal and Michel 2017; G. Carlsson and Vejdemo-Johansson 2021).
Traditional data science pipelines in academia and industry are focused on machine learning and statistical inference of structured (tabular) data, being able to answer questions like:
While persistence diagrams have been found to be a useful summary of datasets in many domains, they are not structured data and therefore require special analysis methods. Some papers (for example (Robinson and Turner 2017; Le and Yamada 2018)) have described post-processing pipelines for analyzing persistence diagrams built on distance (Kerber, Morozov, and Nigmetov 2017) and kernel (Le and Yamada 2018) calculations, however these papers are lacking publicly available implementations in R (and Python), and many more data science methods are possible using such calculations (Murphy 2012; Scholkopf, Smola, and Muller 1998; Gretton et al. 2007; Cox and Cox 2008; Dhillon, Guan, and Kulis 2004).
TDApplied is the first R package which provides applied analysis implementations of published methods for analyzing persistence diagrams using machine learning and statistical inference. Its functions contain highly optimized and scalable code (see the package vignette “Benchmarking and Speedups”) and have been tested and validated (see the package vignette “Comparing Distance Calculations”). TDApplied can interface with other data science packages to perform powerful and flexible analyses (see the package vignette “Personalized Analyses with TDApplied”), and an example usage of TDApplied on real data has been demonstrated (see the package vignette “Human Connectome Project Analysis”).
This vignette documents the background of TDApplied functions and the usage of those functions on simulated data, by considering a typical data analysis workflow for topological data analysis:
To start we must load the TDApplied package:
Let’s get started!
PyH
FunctionThe main tool of topological data analysis is called persistent homology (Edelsbrunner, Letscher, and Zomorodian 2000; Zomorodian and Carlsson 2005). Persistent homology starts with either data points and a distance function, or a distance matrix storing distance values between data points. It assumes that these points arise from a dataset with some kind of “shape”. This “shape” has certain features that exist at various scales, but sampling induces noise in these features. Persistent homology aims to describe certain mathematical features of this underlying shape, by forming approximations to the shape at various distance scales. The mathematical features which are tracked include clusters (connected components), loops (ellipses) and voids (spheres), which are examples of cycles (i.e. different types of holes). The homological dimension of these features are 0, 1 and 2, respectively. What is interesting about these particular mathematical features is that they can tell us where our data is not, which is extremely important information which other data analysis methods cannot provide.
The persistent homology algorithm proceeds in the following manner: first, if the input is a dataset and distance metric, then the distance matrix, storing the distance metric value of each pair of points in the dataset, is computed. Next, a parameter ϵ ≥ 0 is grown starting at 0, and at each ϵ value we compute a shape approximation of the dataset Cϵ, called a simplicial complex or in this case a Rips complex. We construct Cϵ by connecting all pairs of points whose distance is at most ϵ. To encode higher-dimensional structure in these approximations, we also add a triangle between any triple of points which are all connected (note that no triangles are formally shaded on the above diagram, even though there are certainly triples of connected points), a tetrahedron between any quadruple of points which are all connected, etc. Note that this process of forming a sequence of skeletal approximations is called a Rips-Vietoris filtration, and other methods exist for forming the approximations.
At any given ϵ value, some topological features will exist in Cϵ. As ϵ grows, the Cϵ’s will contain each other, i.e. if ϵ1 < ϵ2, then every edge (triangle, tetrahedron etc.) in Cϵ1 will also be present in Cϵ2. Each topological feature of interest will be “born” at some ϵbirth value, and “die” at some some ϵdeath value – certainly each feature will die once the whole dataset is connected and has trivial shape structure. Consider the example of a loop – a loop will be “born” when the last connection around the circumference of the loop is connected (at the ϵ value which is the largest distance between consecutive points around the loop), and the loop will “die” when enough connections across the loop fill in its hole. Since the topological features are tracked across multiple scales, we can estimate their (static) location in the data, i.e. finding the points on these structures, by calculating what are called representative cycles.
The output of persistent homology, a persistence diagram, has one 2D point for each topological feature found in the filtration process in each desired homological dimension, where the x-value of the point is the birth ϵ-value and the y-value is the death ϵ-value. Hence every point in a persistence diagram lies above the diagonal – features die after they are born! The difference of a points y and x value, y − x, is called the “persistence” of the corresponding topological feature. Points which have high (large) persistence likely represent real topological features of the dataset, whereas points with low persistence likely represent topological noise.
For a more practical and scalable computation of persistence
diagrams, a method has been introduced called persistence
cohomology (Silva and Vejdemo-Johansson
2011b, 2011a) which calculates the exact same output as
persistent homology (with analogous “representative co-cycles” to
persistent homology’s representative cycles) just much faster.
Persistent cohomology is implemented in the c++ library
ripser (Bauer 2015),
which is wrapped in R in the TDAstats package (Wadhwa et al. 2019) and in Python in the
ripser module. However, it was observed in simulations
that the Python implementation of ripser seemed faster,
even when called in R via the reticulate package (Ushey, Allaire, and Tang 2022) (see the package
vignette “Benchmarking and Speedups”). Therefore, the
TDApplied PyH
function has been
implemented as a wrapper of the Python ripser module
for fast calculations of persistence diagrams.
There are three prerequisites that must be satisfied in order to use
the PyH
function:
reticulate::py_install("ripser")
. Some windows machines
have had issues with recent versions of the ripser
module, but version 0.6.1 has been tested and does work on Windows. So,
Windows users may try
reticulate::py_install("ripser==0.6.1")
.For a sample use of PyH
we will use the following
pre-loaded dataset called “circ” (which is stored as a data frame in
this vignette):
We would then calculate the persistence diagram as follows:
# import the ripser module
ripser <- import_ripser()
# calculate the persistence diagram
diag <- PyH(X = circ,maxdim = 1,thresh = 2,ripser = ripser)
# view last five rows of the diagram
diag[47:51,]
#> dimension birth death
#> 47 0 0.0000000 0.2545522
#> 48 0 0.0000000 0.2813237
#> 49 0 0.0000000 0.2887432
#> 50 0 0.0000000 2.0000000
#> 51 1 0.5579783 1.7385925
In the package vignette “Benchmarking and Speedups”, simulations are
used to demonstrate the practical advantages of using PyH
to calculate persistence diagrams compared to other alternatives.
Note that the installation status of Python for PyH
is
checked using the function reticulate::py_available()
,
which according to several online forums does not always behave as
expected. If error messages occur using TDApplied
functions regarding Python not being installed then we recommend
consulting online resources to ensure that the py_available
function returns TRUE
on your system. Due to the
complicated dependencies required to use the PyH
function,
it is only an optional function in the TDApplied
package and therefore the reticulate package is only suggested in the
TDApplied namespace.
enclosing_radius
FunctionOne of the most important parameters in a persistent (co)homology
calculation is the maximum connectivity radius of the filtration (which
is the maxdim
parameter in the PyH
function).
A small radius threshold may prematurely stop the filtration before we
can identify real large-scale features of our data, while an overly
large threshold could quickly cause the calculations to become
intractable. While there is no method that can determine an optimal
balance between these two extremes, a radius threshold exists, called
the “enclosing radius”, beyond which the topology of the dataset is
trivial (i.e. there are no topological features except for one connected
component). This radius is obtained by calculating, for each data point,
the maximum distance from that point to all other points, and taking the
minimum of these values.
The enclosing radius of a dataset can be computed using
TDApplied’s enclosing_radius
function.
Consider the topology of a dataset sampled from a circle and its
origin:
Without knowing anything about the structure of our new dataset we would not know which distance threshold to use for persistent homology calculations. We would compute the enclosing radius as follows:
This radius is the distance from the origin point to all the points on the circumference - at which point the whole graph is connected and the loop is gone:
We will see in a later section how to construct these types of graphs
using the vr_graphs
and plot_vr_graph
functions.
diagram_to_df
FunctionThe most typical data structure used in R for data science is a data
frame. The output of the PyH
function is a data frame
(unless representatives are calculated, in which case the output is a
list containing a data frame and other information), but the persistence
diagrams calculated from the popular R packages TDA
(B. T. Fasy et al. 2021) and
TDAstats (Wadhwa et al.
2019) are not stored in data frames, making subsequent machine
learning and inference analyses of these diagrams challenging. Since In
order to solve this problem the TDApplied function
diagram_to_df
can convert
TDA/TDAstats persistence diagrams into
data frames:
# convert TDA diagram into data frame
diag1 <- TDA::ripsDiag(circ,maxdimension = 1,maxscale = 2,library = "dionysus")
diag1_df <- diagram_to_df(diag1)
class(diag1_df)
#> [1] "data.frame"
# convert TDAstats diagram into data frame
diag2 <- TDAstats::calculate_homology(circ,dim = 1,threshold = 2)
diag2_df <- diagram_to_df(diag1)
class(diag2_df)
#> [1] "data.frame"
When a persistence diagram is calculated with either
PyH
, ripsDiag
or alphaComplexDiag
and contains representatives, diagram_to_df
only returns
the persistence diagram data frame (i.e. the representatives are
ignored).
diagram_distance
and diagram_kernel
FunctionsEarlier we mentioned that persistence diagrams do not form structured data, and now we will give an intuitive argument for why this is the case. A persistence diagram {(x1, y1), …, (xn, yn)} containing n topological features can be represented in a vector of length 2n, (x1, y1, x2, y2, …, xn, yn). However, we cannot easily combine vectors calculated in this way into a table with a fixed number of feature columns because
Fortunately, we can still apply many machine learning and inference techniques to persistence diagrams provided we can quantify how near (similar) or far (distant) they are from each other, and these calculations are possible with distance and kernel functions.
There are several ways to compute distances between persistence diagrams in the same homological dimension. The most common two are called the 2-wasserstein and bottleneck distances (Kerber, Morozov, and Nigmetov 2017; Edelsbrunner and Harer 2010). These techniques find an optimal matching of the 2D points in their input two diagrams, and compute a cost of that optimal matching. A point from one diagram is allowed either to be paired (matched) with a point in the other diagram or its diagonal projection, i.e. the nearest point on the diagonal line y = x (matching a point to its diagonal projection is essentially saying that feature is likely topological noise because it died very soon after it was born).
Allowing points to be paired with their diagonal projections both allows for matchings of persistence diagrams with different numbers of points (which is almost always the case in practice) and also formalizes the idea that some points in a persistence diagram represent noise. The “cost” value associated with a matching is given by either (i) the maximum of infinity-norm distances between paired points, or (ii) the square-root of the sum of squared infinity-norm between matched points. The cost of the optimal matching under loss (i) is the bottleneck distance of persistence diagrams, and the cost of the optimal matching of cost (ii) is called the 2-wasserstein metric of persistence diagrams. Both distance metrics have been used in a number of applications, but the 2-wasserstein metric is able to find more fine-scale differences in persistence diagrams compared to the bottleneck distance. The problem of finding an optimal matching can be solved with the Hungarian algorithm, which is implemented in the R package clue (Hornik 2005).
We will introduce three new simple persistence diagrams, D1, D2 and D3, for examples in this section (and future ones):
Here is a plot of the optimal matchings between D1 and D2, and between D1 and D3:
In the picture we can see that there is a “better” matching between D1 and D2 compared to D1 and D3, so the (wasserstein/bottleneck) distance value between D1 and D2 would be smaller than that of D1 and D3.
Another distance metric between persistence diagrams, which will be useful for kernel calculations, is called the Fisher information metric, dFIM(D1, D2, σ) (Le and Yamada 2018). The idea is to represent the two persistence diagrams as probability density functions, with a 2D-Gaussian point mass centered at each point in both diagrams (including the diagonal projections of the points in the opposite diagram), all of variance σ2 > 0, and calculate how much those distributions disagree on their pdf value at each point in the plane (called their Fisher information metric).
Points in the rightmost plot which are close to white in color have the most similar pdf values in the two distributions, and would not contribute to a large distance value; however, having more points with a red color would contribute to a larger distance value.
The wasserstein and bottleneck distances have been implemented in the
TDApplied function diagram_distance
. We
can confirm that the distance between D1 and D2 is smaller than D1 and
D3 for both distances:
# calculate 2-wasserstein distance between D1 and D2
diagram_distance(D1,D2,dim = 0,p = 2,distance = "wasserstein")
#> [1] 0.3905125
# calculate 2-wasserstein distance between D1 and D3
diagram_distance(D1,D3,dim = 0,p = 2,distance = "wasserstein")
#> [1] 0.559017
# calculate bottleneck distance between D1 and D2
diagram_distance(D1,D2,dim = 0,p = Inf,distance = "wasserstein")
#> [1] 0.3
# calculate bottleneck distance between D1 and D3
diagram_distance(D1,D3,dim = 0,p = Inf,distance = "wasserstein")
#> [1] 0.5
There is a generalization of the 2-wasserstein distance for any p ≥ 1, the p-wasserstein
distance, which can also be computed using the
diagram_distance
function by varying the parameter
p
, although p = 2
seems to be the most popular parameter choice.
The diagram_distance
function can also calculate the
Fisher information metric between persistence diagrams:
# Fisher information metric calculation between D1 and D2 for sigma = 1
diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1)
#> [1] 0.02354779
# Fisher information metric calculation between D1 and D3 for sigma = 1
diagram_distance(D1,D3,dim = 0,distance = "fisher",sigma = 1)
#> [1] 0.08821907
Again, D1 and D2 are less different than D1 and D3 using the Fisher information metric.
A fast approximation to the Fisher information metric was described
in (Le and Yamada 2018), and C++ code in
the GitHub repository (https://github.com/vmorariu/figtree) was used to
calculate this approximation in Matlab. Using the Rcpp package (Eddelbuettel and Francois 2011) this code is
included in TDApplied and the approximation can be
calculated by providing the positive rho
parameter:
# Fisher information metric calculation between D1 and D2 for sigma = 1
diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1)
#> [1] 0.02354779
# fast approximate Fisher information metric calculation between D1 and D3 for sigma = 1
diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1,rho = 0.001)
#> [1] 0.02354779
Now we will explore calculating similarity of persistence diagrams using kernel functions. A kernel function is a special (positive semi-definite) symmetric similarity measure between objects in some complicated space which can be used to project data into a space suitable for machine learning (Murphy 2012). Some examples of machine learning techniques which can be “kernelized” when dealing with complicated data are k-means (kernel k-means), principal components analysis (kernel PCA), and support vector machines (SVM) which are inherently based on kernel calculations.
There have been, to date, four main kernels proposed for persistence diagrams. In TDApplied the persistence Fisher kernel (Le and Yamada 2018) has been implemented because of its practical advantages over the other kernels – smaller cross-validation SVM error on a number of test data sets and a faster method for cross validation. For information on the other three kernels see (Kusano, Fukumizu, and Hiraoka 2018; Carriere, Cuturi, and Oudot 2017; Reininghaus et al. 2014).
The persistence Fisher kernel is computed directly from the Fisher information metric between two persistence diagrams: let σ > 0 be the parameter for dFIM, and let t > 0. Then the persistence Fisher kernel is defined as kPF(D1, D2) = exp(−t * dFIM(D1, D2, σ)).
Computing the persistence Fisher kernel can be achieved with the
diagram_kernel
function in TDApplied:
# calculate the kernel value between D1 and D2 with sigma = 2, t = 2
diagram_kernel(D1,D2,dim = 0,sigma = 2,t = 2)
#> [1] 0.9872455
# calculate the kernel value between D1 and D3 with sigma = 2, t = 2
diagram_kernel(D1,D3,dim = 0,sigma = 2,t = 2)
#> [1] 0.9707209
As before, D1 and D2 are more similar than D1 and D3, and if desired a fast approximation to the kernel value can be computed.
plot_diagram
Persistence diagrams can contain any number of points representing
different types of topological features from different homological
dimensions. However we can easily view this information in a
two-dimensional plot of the birth and death values of the topological
features, where each homological dimension has a unique point shape and
color, using TDApplied’s plot_diagram
function:
Now that we can visualize persistence diagrams we will describe four tools which can be used for their interpretation – filtering out noisy topological features with the universal null distribution and bootstrapping methods, representative (co)cycles and Vietoris-VR graphs (called VR graphs for short).
universal_null
FunctionNoise in datasets can drastically affect the results of inference and machine learning procedures. Therefore it is desirable to clean data before applying such procedures. Noise in persistence diagrams are features whose birth and death values are very similar. The question of determining which points in a persistence diagrams are “significant” and which are “noise” has been recently addressed via a parametric approach in (Bobrowski and Skraba 2023).
The idea is to compute a test statistic for each feature based on the ratio of death divided by birth and to compare these statistics against the left-skewed Gumbel distribution to compute p-values, and use a Bonferroni correction to ensure that the probability of getting a false positive significant feature is at most α. This is a very strict thresholding procedure, testing each p-value p for a topological feature in dimension i against α/ni where ni is the number of topological features of that dimension (so typical choices for α, like 0.05 or 0.01 etc., may be too strict for some analyses).
We would be tempted to use this procedure to locate the main loop in our circ dataset, but the results are surprising:
# determine significant features
circ_result <- universal_null(circ,
thresh = enclosing_radius(circ))
circ_result$subsetted_diag
#> [1] dimension birth death
#> <0 rows> (or 0-length row.names)
No loop was found! The reason for this is because since circ is a “topologically perfect” dataset, with no noise and therefore only one loop, the singular loop can never be significant without smaller loops to compare it to. The same phenomenon will occur in other datasets like a sampled unit sphere, etc., but does this mean that the procedure does not work?
Certainly not! All real world datasets will contain some measure of noise and therefore will likely have more than one feature in a given homological dimension. Let’s add some noise to our unit_circ dataset and rerun the procedure:
# create circle with noise dataset and plot
circ_with_noise <- circ
x_noise <- stats::rnorm(n = 50,sd = 0.1)
y_noise <- stats::rnorm(n = 50,sd = 0.1)
circ_with_noise$x <- circ_with_noise$x + x_noise
circ_with_noise$y <- circ_with_noise$y + y_noise
plot(circ_with_noise)
# rerun the inference procedure
library(TDA)
noisy_circ_result <- universal_null(circ_with_noise,
FUN_diag = "ripsDiag",
thresh = enclosing_radius(circ_with_noise),
return_pvals = TRUE)
noisy_circ_result$subsetted_diag
#> dimension birth death
#> 54 1 0.7802017 1.542033
noisy_circ_result$pvals
#> [1] 0.0002181983
We have successfully found the significant loop in the noisy dataset
(with p-value less than alpha
). Remember though that this
procedure is very strict - it is much more likely to disregard real
topological features than to retain noise features, so if the
universal_null
function says that your dataset has no
significant features when you are confident there are some you can try
increasing the alpha
parameter to carry out a more relaxed
thresholding.
While the enclosing_radius
function, used in our
examples, may be an appropriate connectivity threshold for most
applications, it may be too computationally intensive for others. When
lower thresholds are used, real topological features may be obscured
because the threshold is much smaller than its (unobserved) death
radius. Thankfully though, the universal_null
procedure can
handle these “infinite cycles”, by:
alpha
parameter to determine the minimum connectivity threshold for persistent
homology where a given infinite cycle feature might be
significant,Picking a very small threshold, resulting in many infinite cycles, will be very computationally demanding requiring the computation of many persistence diagrams, so care should be taken when picking the threshold when performing this infinite cycle inference.
Here’s an example of performing inference on the noisy circ dataset, with and without infinite cycle inference at a lower threshold value:
# inference without infinite cycle inference
res_non_inf_small_thresh <- universal_null(circ_with_noise,
FUN_diag = "ripsDiag",
thresh = 0.9)
res_non_inf_small_thresh$subsetted_diag
#> [1] dimension birth death
#> <0 rows> (or 0-length row.names)
# inference with infinite cycle inference
res_inf_small_thresh <- universal_null(circ_with_noise,
FUN_diag = "ripsDiag",
thresh = 0.9,
infinite_cycle_inference = TRUE)
res_inf_small_thresh$subsetted_diag
#> dimension birth death
#> 1 1 0.7802017 0.9
At this smaller threshold the procedure with infinite cycle inference was able to locate the main loop.
In a later section you will read about “representative (co)cycles”, which are subsets of the datapoints on a topological feature that can be used to help locate the feature in our dataset, but for now just know that if desired this procedure will subset the representatives for only the significant topological features, but infinite cycles will not have representatives.
bootstrap_persistence_thresholds
FunctionAnother approach in the literature for filtering significant topological features of datasets was based on bootstrapping ((B. Fasy et al. 2014)), although this method is far slower than the aforementioned universal null distribution approach because it requires the computation of many persistence diagrams and distances. On the other hand, for topologically simple datasets without noise (like a sampled unit circle) the universal null distribution will often fail to retrieve the real topological features.
The idea of the bootstrap procedure is as follows, where X is the input data set and α is the desired threshold for type 1 error:
These thresholds form a square-shaped “confidence interval” around each point in D. In particular, if t was the threshold found for dimension k then the confidence interval around a point (x, y) ∈ D (of dimension k) is the set of points {(x′, y′) : max(|x − x′|, |y − y′|) < t}. For example, if we calculated the bottleneck-threshold-based confidence interval around the single 1-dimensional point in the circ dataset’s persistence diagram, we would get something like this:
In this setup, “significant” points will be those whose confidence intervals do not touch the diagonal line, i.e. where birth and death is the same. Note that the persistence threshold is twice the bottleneck distance threshold calculated above: the (Euclidean) distance from the point to the bottom right corner of the box is $\sqrt{2}t$, which must be greater than or equal to the (Euclidean) distance of the point to its diagonal projection, which is $\frac{y-x}{\sqrt{2}}$. Therefore, for the point to be considered real (i.e. significant), $\sqrt{2}t \leq \frac{y-x}{\sqrt{2}}$, implying that the persistence of the point, y − x, must be no less than twice the bottleneck threshold t.
Like in (Robinson and Turner 2017) we can calculate the p-value for a feature as $p = \frac{Z + 1}{N + 1}$ where Z is the number of bootstrap iterations which, when doubled, are at least the persistence of the feature, and N is the number of bootstrap samples. In order to ensure that the p-value of any topological feature which survives the thresholding is less than α we transform α by $\alpha \leftarrow \frac{\mbox{max}\{\alpha(N + 1) - 1,0\}}{N + 1}$.
The function bootstrap_persistence_thresholds
can be
used to determine these persistence thresholds. Here is an example for
the circ dataset, and the results can be plotted with the
plot_diagram
function (with overlaid p-values using the
graphics text
function):
# calculate the bootstrapped persistence thresholds using 2 cores
# and 30 iterations. We'll use the distance matrix of circ to
# make representative cycles more comprehensible
library("TDA")
thresh <- bootstrap_persistence_thresholds(X = as.matrix(dist(circ)),
FUN_diag = 'ripsDiag',
FUN_boot = 'ripsDiag',
distance_mat = T,
maxdim = 1,thresh = 2,num_workers = 2,
alpha = 0.05,num_samples = 30,
return_subsetted = T,return_pvals = T,
calculate_representatives = T)
diag <- thresh$diag
# plot original diagram and thresholded diagram side-by-side, including
# p-values. These p-values are the smallest possible (1/31) when there
# are 30 bootstrap iterations
par(mfrow = c(1,2))
plot_diagram(diag,title = "Circ diagram")
plot_diagram(diag,title = "Circ diagram with thresholds",
thresholds = thresh$thresholds)
text(x = c(0.2,0.5),y = c(2,1.8),
paste("p = ",round(thresh$pvals,digits = 3)),
cex = 0.5)
To see why we needed to load the TDA package prior
to the function call please see the
bootstrap_persistence_thresholds
function documentation.
The bootstrap procedure can be done in parallel or sequentially,
depending on which function is specified to calculate the persistence
diagrams. There are three possible such functions –
TDAstats’ calculate_homology
,
TDA’s ripsDiag
and
TDApplied’s PyH
. The functions
calculate_homology
and ripsDiag
can be run in
parallel. However, PyH
is the fastest function followed by
calculate_homology
based on our simulations. Both
ripsDiag
and PyH
allow for the calculation of
representative (co)cycles (i.e. the approximate location in the data of
each topological feature), whereas calculate_homology
does
not. Therefore, our recommendation is to pick the function according to
the following criteria: if a user can use the PyH
function,
then it should be used in all cases except for when the input data set
is small, the machine has many available cores and the number of desired
bootstrap iterations is large. Otherwise, use
calculate_homology
for speed or ripsDiag
for
calculating representatives.
One of the advantages of the R package TDA over
TDAstats is its ability to calculate representative
cycles in the data, i.e. locating the persistence diagram topological
features in the input data set. Having access to representative cycles
can permit deep analyses of the original data set by finding particular
types of features spanned by certain subsets of data points. For
example, a representative cycle of a 1-dimensional topological feature
would be a set of edges between data points which lie along that feature
(a loop). The PyH
function can also find representative
cocycles (i.e. analogues of representative cycles for persistent
cohomology) in its input data, which are returned if the
calculate_representatives
parameter is set to
TRUE
. In that case, the PyH
function returns a
list with a data frame called “diagram”, containing the persistence
diagram, and a list called “representatives”. The “representatives” list
has one element for each homological dimension in the persistence
diagram, with one matrix/array for each point in the persistence diagram
of that dimension (except for dimension 0, where the list is always
empty). The matrix for a d-dimensional feature (1 for loops,
2 for voids, etc.) has d + 1
columns, where row i contains
the row indices in the data set of the data points in the i-th substructure in the
representative (a substructure of a loop would be an edge, a
substructure of a void would be a triangle, etc.). Here is an example
where we calculate the representative cocycles of our circ dataset:
# ripser has already been imported, so calculate diagram with representatives
diag_rep <- PyH(circ,maxdim = 1,thresh = 2,ripser = ripser,calculate_representatives = T)
# identify the loops in the diagram
diag_rep$diagram[which(diag_rep$diagram$dimension == 1),]
#> dimension birth death
#> 50 1 0.5579783 1.738593
# show the representative for the loop, just the first five rows
diag_rep$representatives[[2]][[1]][1:5,]
#> [,1] [,2]
#> [1,] 50 42
#> [2,] 46 42
#> [3,] 50 4
#> [4,] 42 29
#> [5,] 42 16
The representative of the one loop contains the edges found to be present in the loop. We could iterate over the representative for the loop to find all the data points in that representative:
#> [1] 50 46 42 29 16 7 48 4 11 25 40 37 22 43 15 20 34 32 27 8 44 36 39 5 1
#> [26] 21 24
Since the circ dataset is two-dimensional, we could actually plot the loop representative according to the following process:
Since the death radius of the main cocycle is 1.738894, we can use the following code to plot the cocycle at thresholds value 1.7:
plot(x = circ$x,y = circ$y,xlab = "x",ylab = "y",main = "circ with representative")
for(i in 1:nrow(circ))
{
for(j in 1:nrow(circ))
{
pt1 <- circ[i,]
pt2 <- circ[j,]
if(sqrt((pt1[[1]] - pt2[[1]])^2 + (pt1[[2]] - pt2[[2]])^2) <= 1.7)
{
graphics::lines(x = c(pt1[[1]],pt2[[1]]),y = c(pt1[[2]],pt2[[2]]))
}
}
}
for(i in 1:nrow(diag_rep$representatives[[2]][[1]]))
{
pt1 <- circ[diag_rep$representatives[[2]][[1]][i,1],]
pt2 <- circ[diag_rep$representatives[[2]][[1]][i,2],]
if(sqrt((pt1[[1]] - pt2[[1]])^2 + (pt1[[2]] - pt2[[2]])^2) <= 1.7)
{
graphics::lines(x = c(pt1[[1]],pt2[[1]]),y = c(pt1[[2]],pt2[[2]]),col = "red")
}
}
This plot shows the main loop that was found via persistent
cohomology, and the representative is a set of edges (in this 2D case)
whose removal would destroy the loop. A more intuitive notion of a
“representative loop” can be found with persistent homology, for
instance using the TDA ripsDiag
function
with the location
parameter set to TRUE. Another example of
the representative cocycle can be found using another threshold value,
for instance the (rounded up) birth value of 0.6009:
Since we know that the data points in the representatives help form a loop in the original data set, we could perform a further exploratory analysis in circ to explore the periodic nature of the feature. While in this example we know that a loop will be present, this type of analysis could help find hidden latent structure in data sets.
vr_graphs
and plot_vr_graph
FunctionsIntegral to the plots in the previous section were that the circ dataset is 2D, but most datasets are not 2D. In order to investigate topological features of high-dimensional datasets we can use the Rips complexes which are calculated as an intermediate step in persistent homology (as described earlier). Recall that the Rips complex of a dataset is a skeletal representations of the dataset’s structure at a particular scale ϵ formed by connecting data points of distance at most ϵ and adding higher-dimensional structures (like triangles between triples of connected points). The connections between data points in a Rips complex form a “VR graph” (Zomorodian 2010), and this graph can be visualized regardless of the dimension of the underlying dataset.
We can use the function vr_graphs
to compute VR graphs
at a variety of scales, and can visualize the resulting graphs with the
igraph package (Csardi and
Nepusz 2006) and the plot_vr_graph
function. To
demonstrate this functionality, we will pick two ϵ scales to study the dataset
structure of circ, only based on its persistence diagram - the first
scale will be half the birth radius of the loop and the second scale
will be the average of the loop’s birth and death radius:
# get half of loop's birth radius
eps_1 <- diag[nrow(diag),2L]/2
# get mean of loop's birth and death radii
eps_2 <- (diag[nrow(diag),3L] + diag[nrow(diag),2L])/2
# compute two VR graphs
gs <- vr_graphs(X = circ,eps = c(eps_1,eps_2))
Next we can plot both VR graphs:
layout <- apply(layout,MARGIN = 2,FUN = function(X){
return(-1 + 2*(X - min(X))/(max(X) - min(X)))
})
The first graph shows that at a scale before the loop is born, the dominant loop does not exist in the data (and that the space is more fragmented), while the second graph was able to retrieve the dominant loop. Visualizing multiple VR graphs of a dataset, choosing particular scales of interest from the persistence diagram of the dataset, can be an effective tool for interpreting topological features.
The input parameters of plot_vr_graph
can help customize
the graph visualizations. For example, if we want to investigate the
loop we can also customize the plot to highlight the loop
representative, and to only plot the component of the graph which
contains the loop vertices:
We can customize one level further by removing the vertex labels and using the graph layout (i.e. x-y coordinates of all vertices) to plot other things on the graph:
# plot only component containing the loop stimuli with vertex colors
plot_vr_graph(gs,eps_2,cols = colors,
component_of = stimuli_in_loop[[1]],
vertex_labels = FALSE,
layout = layout)
# get indices of vertices in loop
# not necessary in this case but necessary when we have
# removed some vertices from the graph
vertex_inds <- match(stimuli_in_loop,as.numeric(rownames(layout)))
# add volcano image over loop nodes
# image could be anything like rasters read from
# png files! this is just an example..
utils::data("volcano")
volcano <- (volcano - min(volcano)) / diff(range(volcano))
for(i in vertex_inds)
{
graphics::rasterImage(volcano,xleft = layout[i,1L] - 0.05,
xright = layout[i,1L] + 0.05,
ybottom = layout[i,2L] - 0.05,
ytop = layout[i,2L] + 0.05)
}
These visualizations can help determine which points are part of a representative and what the dataset structure is at various scales.
Having visualized and interpreted our data of interest (persistence diagrams), a common next step in data analysis pipelines is to ask statistical questions in the form of hypothesis testing (Casella, Berger, and Company 2002). Three such questions that we will focus on are:
In order to answer these questions we need to generate groups of persistence diagrams. For the third question we will generate diagrams from simulated circle datasets. For the first two questions we will generate diagrams which are random deviations of D1, D2 and D3 diagrams from earlier:
The helper function generate_TDApplied_vignette_data
(seen below) adds Gaussian noise with a small variance to the birth and
death values of the points in D1, D2 and D3, making “noisy copies” of
the three diagrams:
Here is an example D1 and two noisy copies:
permutation_test
FunctionOne of the most important inference procedures in classical statistics is the analysis of variance (ANOVA), which can find differences in the means of groups of normally-distributed measurements (Casella, Berger, and Company 2002). Distributions of persistence diagrams and their means can be complicated (see (Turner 2013) and (Turner et al. 2014)). Therefore, a non-parametric permutation test has been proposed which can find differences in groups of persistence diagrams. Such a test was first proposed in (Robinson and Turner 2017), and some variations have been suggested in later publications. In (Robinson and Turner 2017), two groups of persistence diagrams would be compared. The null hypothesis, H0, is that the diagrams from the two groups are generated from shapes with the same type and scale of topological features, i.e. they “come” from the same “shape”. The alternative hypothesis, HA, is that the underlying type or scale of the features are different between the two groups. In each dimension a p-value is computed, finding evidence against H0 in that dimension. A measure of within-group distances (a “loss function”) is calculated for the two groups, and that measure is compared to a null distribution for when the group labels are permuted.
This inference procedure is implemented in the
permutation_test
function, with several speedups and
additional functionalities. Firstly, the loss function is computed in
parallel for scalability since distance computations can be expensive.
Secondly, we store distance calculations as we compute them because
these calculations are often repeated. Additional functionality includes
allowing for any number groups (not just two) and allowing for a pairing
between groups of the same size as described in (Abdallah (2021)). When a natural pairing exists
between the groups (such as if the groups represent persistence diagrams
from the same subject of a study in different conditions) we can
simulate a more realistic null distribution by restricting the way in
which we permute group labels, achieving higher statistical power.
In order to demonstrate the permutation test we will detect differences between noisy copies of D1, D2, D3:
# permutation test between three diagrams
g1 <- generate_TDApplied_vignette_data(3,0,0)
g2 <- generate_TDApplied_vignette_data(0,3,0)
g3 <- generate_TDApplied_vignette_data(0,0,3)
perm_test <- permutation_test(g1,g2,g3,
num_workers = 2,
dims = c(0))
perm_test$p_values
#> 0
#> 0.04761905
As expected, a difference was found (at the α = 0.05 significance level) between the three groups.
The package TDAstats also has a function called
permutation_test
which is based on the same test procedure.
However, it uses an unpublished distance metric between persistence
diagrams (see the package vignette “Comparing Distance Calculations”)
and does not use parallelization for scalability. As such, care must be
taken when both TDApplied and TDAstats
are attached in an R script to use the particular
permutation_test
function desired.
independence_test
FunctionAn important question when presented with two groups of paired data is determining if the pairings are independent or not. A procedure was described in (Gretton et al. 2007) which can be used to answer this question using kernel computations, and importantly uses a parametric null distribution. The null hypothesis for this test is that the groups are independent, and the alternative hypothesis is that the groups are not independent. A test statistic called the Hilbert-Schmidt independence criteria (Gretton et al. 2007) is calculated, and its value is compared to a gamma distribution with certain parameters which are estimated from the data.
This inference procedure has been implemented in the
independence_test
function, and returns the p-value of the
test in each desired dimension of the diagrams (among other additional
information). We would expect to find no dependence between noisy copies
of D1, D2 and D3, since each copy is generated randomly:
# create 10 noisy copies of D1 and D2
g1 <- generate_TDApplied_vignette_data(10,0,0)
g2 <- generate_TDApplied_vignette_data(0,10,0)
# do independence test with sigma = t = 1
indep_test <- independence_test(g1,g2,dims = c(0),num_workers = 2)
indep_test$p_values
#> 0
#> 0.3045212
The p-value of this test would not be significant at any typical significance threshold, reflecting the fact that there is no real (i.e. non-spurious) dependence between the two groups, as expected.
permutation_model_inference
FunctionThe permutation test procedure allows us to test whether two groups of persistence diagrams are different, and the bootstrap procedure allows us to estimate the sampling distribution of the persistence diagram of a single dataset. Putting these two concepts together we can test whether two datasets are topologically similar or not, i.e. if each dataset is a good topological “model” of the other.
The procedure we will employ is to
A small p-value in a given homological dimension of this test
indicates that the two datasets had a significant topological difference
in that dimension. Let’s take an example of two datasets generated by a
circle, our original circ
and a new dataset
circ2
:
When we run the model inference procedure we find that there is no topological difference (at the 0.05 significance level) in dimensions 0 or 1:
mod_comp <- permutation_model_inference(circ, circ2, iterations = 20,
thresh = 2, num_samples = 3,
num_workers = 2L)
mod_comp$p_values
#> 0 1
#> 0.6190476 1.0000000
Now suppose we knew there was a correspondence between the rows of
the two datasets, for instance row i in both datasets corresponded to
the same subject in two studies. For a practical example consider the
two datasets circ
and circ
with a small
variance Gaussian noise source:
We can carry out a paired permutation test to obtain increased statistical power as follows:
mod_comp_paired <- permutation_model_inference(circ, circ_noise, iterations = 20,
thresh = 2, num_samples = 3,
paired = TRUE, num_workers = 2L)
mod_comp_paired$p_values
#> 0 1
#> 0.7619048 0.6666667
Once again the p-values are not significant. In general a high p-value only “suggests” a better model fit, but in this case we know that the two datasets are highly related.
If we wanted to carry out multiple paired tests between datasets (for
instance data from the same subjects across multiple studies) we can
ensure that the same bootstrap samples are used in all comparisons by
explicitly defining those samples and supplying them to the procedure.
If we had defined another dataset, circ3
, we could carry
out the paired tests as follows:
# in this case we are creating 3 samples of the rows of circ
# (and equivalently the other two datasets)
boot_samples <- lapply(X = 1:3,
FUN = function(X){return(unique(sample(1:nrow(circ),size = nrow(circ),replace = TRUE)))})
# carry out three model comparisons between circ, circ2 and circ3
mod_comp_1_2 <- permutation_model_inference(circ, circ2, iterations = 20,
thresh = 2, num_samples = 3,
paired = TRUE, num_workers = 2L,
samp = boot_samples)
mod_comp_1_3 <- permutation_model_inference(circ, circ3, iterations = 20,
thresh = 2, num_samples = 3,
paired = TRUE, num_workers = 2L,
samp = boot_samples)
mod_comp_2_3 <- permutation_model_inference(circ2, circ3, iterations = 20,
thresh = 2, num_samples = 3,
paired = TRUE, num_workers = 2L,
samp = boot_samples)
Patterns may exist in a collection of persistence diagrams. For example, if the collection contained diagrams from three distinct shapes then we would like to be able to assign three distinct (discrete) labels to the correct subsets of diagrams. On the other hand, maybe the diagrams vary along several dimensions, like the mean persistence of their loops and the mean persistence of their components. In this case we would like to be able to retrieve these continuous features for visualizing all diagrams in the same (2D in this example) space. These two types of analyses are called clustering and dimension reduction respectively, and are two of the most common and popular machine learning applications. We will explore three techniques from these areas - kernel k-means from clustering, and multidimensional scaling and kernel principal components analysis from dimension reduction.
diagram_kkmeans
FunctionKernel k-means (Dhillon, Guan, and Kulis
2004) is a method which can find hidden groups in complex data,
extending regular k-means clustering (Murphy
2012) via a kernel. A “kernel distance” is calculated between a
persistence diagram and a cluster center using only the kernel function,
and the algorithm converges like regular k-means. This algorithm is
implemented in the function diagram_kkmeans
as a wrapper of
the kernlab function kkmeans
. Moreover, a prediction
function predict_diagram_kkmeans
can be used to find the
nearest cluster labels for a new set of diagrams. Here is an example
clustering three groups of noisy copies from D1, D2 and D3:
# create noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(3,3,3)
# calculate kmeans clusters with centers = 3, and sigma = t = 0.1
clust <- diagram_kkmeans(diagrams = g,centers = 3,dim = 0,t = 0.1,sigma = 0.1,num_workers = 2)
# display cluster labels
clust$clustering@.Data
#> [1] 3 3 3 2 2 2 1 1 1
As we can see, the diagram_kkmeans
function was able to
correctly separate the three generating diagrams D1, D2 and D3 (the
cluster labels are arbitrary and therefore may not be 1,1,1,2,2,2,3,3,3;
however, they induce the correct partition).
If we wish to predict the cluster label for new persistence diagrams
(computed via the largest kernel value to any cluster center), we can
use the predict_diagram_kkmeans
function as follows:
# create nine new diagrams
g_new <- generate_TDApplied_vignette_data(3,3,3)
# predict cluster labels
predict_diagram_kkmeans(new_diagrams = g_new,clustering = clust,num_workers = 2)
#> [1] 3 3 3 2 2 2 1 1 1
This function correctly predicted the cluster for each new diagram (assigning each diagram to the cluster label by D1, D2 or D3, depending on which diagram it was generated from).
diagram_mds
FunctionDimension reduction is a task in machine learning which is commonly
used for data visualization, removing noise in data, and decreasing the
number of covariates in a model (which can be helpful in reducing
overfitting). One common dimension reduction technique in machine
learning is called multidimensional scaling (MDS) (Cox and Cox 2008). MDS takes as input an n by n distance (or dissimilarity) matrix
D, computed from n points in a dataset, and outputs
an embedding of those points into a Euclidean space of chosen dimension
k which best preserves the
inter-point distances. MDS is often used for visualizing data in
exploratory analyses, and can be particularly useful when the input data
points do not live in a shared Euclidean space (as is the case for
persistence diagrams). Using the R function cmdscale
from
the package stats (R Core Team (2021)) we
can compute the optimal embedding of a set of persistence diagrams using
any of the three distance metrics with the function
diagram_mds
. Here is an example of the
diagram_mds
function projecting nine persistence diagrams,
three noisy copies sampled from each of D1, D2 and D3, into 2D
space:
# create 9 diagrams based on D1, D2 and D3
g <- generate_TDApplied_vignette_data(3,3,3)
# calculate their 2D MDS embedding in dimension 0 with the bottleneck distance
mds <- diagram_mds(diagrams = g,dim = 0,p = Inf,k = 2,num_workers = 2)
# plot
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(mds[,1],mds[,2],xlab = "Embedding coordinate 1",ylab = "Embedding coordinate 2",
main = "MDS plot",col = as.factor(rep(c("D1","D2","D3"),each = 3)),bty = "L")
legend("topright", inset=c(-0.2,0),
legend=levels(as.factor(c("D1","D2","D3"))),
pch=16, col=unique(as.factor(c("D1","D2","D3"))))
The MDS plot shows the clear separation between the three generating diagrams (D1, D2 and D3), and the embedded coordinates could be used for further downstream analyses.
diagram_kpca
FunctionPCA is another dimension reduction technique in machine learning, but
can be preferable compared to MDS in certain situations because it
allows for the projection of new data points onto an old embedding model
(Murphy 2012). For example, this can be
important if PCA is used as a pre-processing step in model fitting.
Kernel PCA (kPCA) (Scholkopf, Smola, and Muller
1998) is an extension of regular PCA which uses a kernel to
project complex data into a high-dimensional Euclidean space and then
uses PCA to project that data into a low-dimensional space. The
diagram_kpca
method computes the kPCA embedding of a set of
persistence diagrams, and the predict_diagram_kpca
function
can be used to project new diagrams using a pre-trained kPCA model. Here
is an example using a group of noisy copies of D1, D2 and D3:
# create noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(3,3,3)
# calculate their 2D PCA embedding with sigma = t = 2
pca <- diagram_kpca(diagrams = g,dim = 0,t = 2,sigma = 2,features = 2,num_workers = 2)
# plot
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(pca$pca@rotated[,1],pca$pca@rotated[,2],xlab = "Embedding coordinate 1",
ylab = "Embedding coordinate 2",main = "PCA plot",
col = as.factor(rep(c("D1","D2","D3"),each = 3)))
legend("topright",inset = c(-0.2,0),
legend=levels(as.factor(c("D1","D2","D3"))), pch=16,
col=unique(as.factor(c("D1","D2","D3"))))
The function was able to recognize the three groups, and the
embedding coordinates can be used for further downstream analysis.
However, an important advantage of kPCA over MDS is that in kPCA we can
project new points onto an old embedding using the
predict_diagram_kpca
function:
# create nine new diagrams
g_new <- generate_TDApplied_vignette_data(3,3,3)
# project new diagrams onto old model
new_pca <- predict_diagram_kpca(new_diagrams = g_new,embedding = pca,num_workers = 2)
# plot
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(new_pca[,1],new_pca[,2],xlab = "Embedding coordinate 1",
ylab = "Embedding coordinate 2",main = "PCA prediction plot",
col = as.factor(rep(c("D1","D2","D3"),each = 3)))
legend("topright",inset = c(-0.2,0),
legend=levels(as.factor(c("D1","D2","D3"))), pch=16,
col=unique(as.factor(c("D1","D2","D3"))))
As we can see, the original three groups, and their approximate location in 2D space, is preserved during prediction.
One of the most valuable problems that machine learning is used to solve is that of prediction - can a variable Y be predicted from other variables X. Kernel functions can be used to predict a variable Y from a persistence diagram D using a class of models called support vector machines (SVMs) (Murphy 2012).
diagram_ksvm
FunctionSVMs are one of the most popular machine learning techniques for regression and classification tasks. SVMs use a kernel function to project complex data into a high-dimensional space and then find a sparse set of training examples, called “support vectors”, which maximally linearly separate the outcome variable classes (or yield the highest explained variance in the case of regression).
Unlike in the case of dimension reduction or clustering, it is possible that our persistence diagrams may each have a label, either discrete or continuous, which gives us more information about the underlying data represented by the diagram. For instance, if our persistence diagrams represented periodic behavior in stock market trends during a particular temporal window then a useful (discrete) label could be whether the overall market went up or down during that period. Being able to predict such labels from the persistence diagrams themselves is a way to link persistence diagrams to external data through modeling, i.e. classification and regression. Support vector machines
SVMs have been implemented in TDApplied by the
function diagram_ksvm
, tailored for input datasets which
contain pairs of persistence diagrams and their outcome variable labels.
A prediction method is supplied called predict_diagram_ksvm
which can be used to predict the label value of a set of new persistence
diagrams given a pre-trained model. A parallelized implementation of
cross-validation model-fitting is used based on the remarks in (Le and Yamada 2018) for scalability (which
avoids needlessly recomputing persistence Fisher information metric
values). Here is an example of fitting an SVM model on a list of
persistence diagrams for a classification task (guessing whether the
diagram comes from D1, D2 or D3):
# create thirty noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(10,10,10)
# create response vector
y <- as.factor(rep(c("D1","D2","D3"),each = 10))
# fit model with cross validation
model_svm <- diagram_ksvm(diagrams = g,cv = 2,dim = c(0),
y = y,sigma = c(1,0.1),t = c(1,2),
num_workers = 2)
We can use the function predict_diagram_ksvm
to predict
new diagrams like so:
# create nine new diagrams
g_new <- generate_TDApplied_vignette_data(3,3,3)
# predict
predict_diagram_ksvm(new_diagrams = g_new,model = model_svm,num_workers = 2)
#> [1] D1 D1 D1 D2 D2 D2 D3 D3 D3
#> Levels: D1 D2 D3
As we can see the best SVM model was able to separate the three
diagrams We can gain more information about the best model found during
model fitting and the CV results by accessing different list elements of
model_svm
.
There is one main limitation of TDApplied which should be discussed for its own future improvements – TDApplied functions cannot analyze numeric and factor features with persistence diagrams. This may be too inflexible for some applications, where the data may include several persistence diagrams, or a mix of persistence diagrams, numeric and categorical variables. The package vignette “Personalized Analyses with TDApplied” demonstrates how one can circumvent this issue using extra code; however, a future update to TDApplied might construct such functionality directly into its functions.
Current topological data analysis packages in R (and Python) do not provide the ability to carry out statistics and machine learning with persistence diagrams, leading to a high barrier to adoption of topological data analysis in academia and industry. By filling in this gap, the TDApplied package aims to bridge topological data analysis with researchers and data practitioners in the R community. Topological data analysis is an exciting and powerful new field of data analysis, and with TDApplied anyone can access its power for meaningful and creative analyses of data.