--- title: "Comparing Distance Calculations" author: "Shael Brown and Dr. Reza Farivar" output: rmarkdown::html_document: fig_caption: yes vignette: > %\VignetteIndexEntry{Comparing Distance Calculations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: REFERENCES.bib --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) # get original graphic parameters to be able to # revert back at the end of the vignette original_mfrow <- par()$mfrow original_xpd <- par()$xpd original_mar <- par()$mar original_scipen <- options()$scipen if(exists(".Random.seed", .GlobalEnv) == F) { runif(1) } oldseed <- get(".Random.seed", .GlobalEnv) oldRNGkind <- RNGkind() # set some new parameters for viewing and reproducibility options(scipen = 999) set.seed(123) ``` # Introduction A number of R packages exist for computing distances between pairs of persistence diagrams, including **TDA** [@R-TDA], **rgudhi** [@rgudhi] and **TDApplied**. Comparing the speed of these calculations was performed in the "Benchmarking and Speed" package vignette, but here we treat the more basic question of "are these distance calculations the same across packages?" Through examples we show that the answer is unfortunately no, but through exploration we attempt to reconcile these differences and provide guidelines for using the different packages. Moreover, we include a proof of algorithm correctness for **TDApplied**'s distance function and in the following section we describe why we do not compare **TDApplied**'s distance function against that of the **TDAstats** package [@R-TDAstats]. # **TDAstats**' `phom.dist` Function The **TDAstats** package has a (wasserstein) distance function, `phom.dist`, which is described in its package documentation as being "not meaningful without a null distribution". But what does this mean? We examined the source code for **TDAstats**, i.e. its R/inference.R file, and found that the internal function `wass_workhorse` on lines 77-103 is the actual distance calculation. In this function, the persistence values (i.e. death - birth) for topological features of two diagrams (and their diagonal projections appended to the opposite diagram) are ordered from largest to smallest. The persistence values are then paired between the two diagrams according to their orderings, and the absolute differences of these ordered persistence values are exponentiated and summed. This algorithm is not guaranteed to produce an optimal matching of the topological features (in the sense of the usual wasserstein cost function), and is missing the final exponentiation (in the case of the 2-wasserstein metric we take the square root of the sum of squared distances). These differences to the standard wasserstein formulation were clear to the authors of **TDAstats**, and while their `phom.dist` function is not exactly a wasserstein metric it is still a comparison statistic of two persistence diagrams which can be studied with inference techniques. Nevertheless, these differences are the reason why in **TDApplied** documentation we refer to **TDAstats**' distance calculation as being "non-standard" and why comparisons (of calculations and benchmarking) are not made against it. # Examples In order to compare the distance calculations of **TDA**, **rgudhi** and **TDApplied**, we will specify three simple diagrams (the same D1, D2 and D3 from the package vignette "TDApplied Theory and Practice") and consider distances between each of the three possible pairs. The three diagrams are as follows. ```{r,echo = F,include = F} library(TDApplied) ``` ```{r,echo = F,fig.height = 3,fig.width = 7,fig.align = 'center'} D1 = data.frame(dimension = c(0),birth = c(2),death = c(3)) D2 = data.frame(dimension = c(0),birth = c(2,0),death = c(3.3,0.5)) D3 = data.frame(dimension = c(0),birth = c(0),death = c(0.5)) par(mfrow = c(1,3)) plot_diagram(D1,title = "D1",max_radius = 4,legend = F) plot_diagram(D2,title = "D2",max_radius = 4,legend = F) plot_diagram(D3,title = "D3",max_radius = 4,legend = F) ``` ```{r,echo = F} par(mfrow = c(1,1)) ``` D1's point is $(2,3)$, D2's points are $\{(2,3.3),(0,0.5)\}$, and D3's point is $(0,0.5)$. In order to compare the distance calculations of the packages we need to have the correct values of the distances. Using the infinity-norm distance for calculating matchings in the bottleneck and wasserstein distances as in [@distance_calc;@comp_geom], we get the following optimal matchings and distance values: 1. Between D1 and D2 we match D1's $(2,3)$ with D2's $(2,3.3)$, and D2's $(0,0.5)$ with its diagonal projection $(0.25,0.25)$. Therefore, the bottleneck distance is 0.3 and the wasserstein distance is $\sqrt{0.3^2+0.25^2}=\sqrt{0.1525}\approx 0.3905125$. 2. Between D1 and D3 we match D1's $(2,3)$ with its diagonal projection $(2.5,2.5)$, and D3's $(0,0.5)$ with its diagonal projection $(0.25,0.25)$. Therefore, the bottleneck distance is 0.5 and the wasserstein distance is $\sqrt{0.5^2+0.25^2}=\sqrt{0.3125}\approx 0.559017$. 3. Between D2 and D3 we match D2's $(0,0.5)$ with D3's $(0,0.5)$, and D2's $(2,3.3)$ with its diagonal projection $(2.65,2.65)$. Therefore, the bottleneck distance is 0.65 and the wasserstein distance is $\sqrt{0.65^2}=0.65$. For the Fisher information metric [@persistence_fisher] we will use the parameter $\sigma = 1$ for simplicity. We must first calculate vectors $\rho_1$ and $\rho_2$, normalize them by dividing by the sum of their respective elements, then compute $\mbox{cos}^{-1}(\sqrt{\rho_1} \cdot \sqrt{\rho_2})$ (where $\mbox{cos}^{-1}$ is the arccos function). See [@persistence_fisher] for computational details. 1. Between D1 and D2 we augment D1 to contain D2's projection points and vice versa, obtaining diagrams $D'_1 = \{(2,3),(2.65,2.65),(0.25,0.25)\}$ and $D'_2 = \{(2,3.3),(0,0.5),(2.5,2.5)\}$. We compute $\rho_1$ and $\rho_2$ as $$\begin{align} \rho_1 = & \{\mbox{exp}(0) + \mbox{exp}(-0.545/2) + \mbox{exp}(-10.625/2),\\ & \mbox{exp}(-0.545/2) + \mbox{exp}(0) + \mbox{exp}(-11.52/2),\\ & \mbox{exp}(-10.625/2) + \mbox{exp}(-11.52/2) + \mbox{exp}(0),\\ & \mbox{exp}(-0.09/2) + \mbox{exp}(-0.845/2) + \mbox{exp}(-12.365/2),\\ & \mbox{exp}(-10.25/2) + \mbox{exp}(-11.645/2) + \mbox{exp}(-0.125/2),\\ & \mbox{exp}(-0.5/2) + \mbox{exp}(-0.045/2) + \mbox{exp}(-10.125/2)\}/(2\pi)\end{align}$$ $$\begin{align} \rho_2 = & \{\mbox{exp}(-0.09/2) + \mbox{exp}(-10.25/2) + \mbox{exp}(-0.5/2),\\ & \mbox{exp}(-0.845/2) + \mbox{exp}(-11.645/2) + \mbox{exp}(-0.045/2),\\ & \mbox{exp}(-12.365/2) + \mbox{exp}(-0.125/2) + \mbox{exp}(-10.125/2),\\ & \mbox{exp}(0) + \mbox{exp}(-11.84/2) + \mbox{exp}(-0.89/2),\\ & \mbox{exp}(-11.84/2) + \mbox{exp}(0) + \mbox{exp}(-10.25/2),\\ & \mbox{exp}(-0.89/2) + \mbox{exp}(-10.25/2) + \mbox{exp}(0)\}/(2\pi)\end{align}$$           Therefore the arccos of the dot product of the square root of the sum-normalized vectors is approximately 0.02354624. 2. Between D1 and D3 we augment D1 to contain D3's projection point and vice versa, obtaining diagrams $D'_1 = \{(2,3),(0.25,0.25)\}$ and $D'_3 = \{(0,0.5),(2.5,2.5)\}$. We compute $\rho_1$ and $\rho_2$ as $$\begin{align} \rho_1 = & \{\mbox{exp}(0) + \mbox{exp}(-10.625/2),\\ & \mbox{exp}(-10.625/2) + \mbox{exp}(0),\\ & \mbox{exp}(-10.25/2) + \mbox{exp}(-0.125/2),\\ & \mbox{exp}(-0.5/2) + \mbox{exp}(-10.125/2)\}/(2\pi)\end{align}$$ $$\begin{align} \rho_2 = & \{\mbox{exp}(-10.25/2) + \mbox{exp}(-0.5/2),\\ & \mbox{exp}(-0.125/2) + \mbox{exp}(-10.125/2),\\ & \mbox{exp}(0) + \mbox{exp}(-10.25/2),\\ & \mbox{exp}(-10.25/2) + \mbox{exp}(0)\}/(2\pi)\end{align}$$           Therefore the arccos of the dot product of the square root of the sum-normalized vectors is approximately 0.08821907. 3. Between D2 and D3 we augment D2 to contain D3's projection point and vice versa, obtaining diagrams $D'_2 = \{(2,3.3),(0,0.5),(0.25,0.25)\}$ and $D'_3 = \{(0,0.5),(2.65,2.65),(0.25,0.25)\}$. We compute $\rho_1$ and $\rho_2$ as $$\begin{align} \rho_1 = & \{\mbox{exp}(0) + \mbox{exp}(-11.84/2) + \mbox{exp}(-12.365/2),\\ & \mbox{exp}(-11.84/2) + \mbox{exp}(0) + \mbox{exp}(-0.125/2),\\ & \mbox{exp}(-12.365/2) + \mbox{exp}(-0.125/2) + \mbox{exp}(0),\\ & \mbox{exp}(-11.84/2) + \mbox{exp}(0) + \mbox{exp}(-0.125/2),\\ & \mbox{exp}(-0.845/2) + \mbox{exp}(-11.645/2) + \mbox{exp}(-11.52/2),\\ & \mbox{exp}(-12.365/2) + \mbox{exp}(-0.125/2) + \mbox{exp}(0)\}/\sqrt{2\pi}\end{align}$$ $$\begin{align} \rho_2 = & \{\mbox{exp}(-11.84/2) + \mbox{exp}(-0.845/2) + \mbox{exp}(-12.365/2),\\ & \mbox{exp}(0) + \mbox{exp}(-11.645/2) + \mbox{exp}(-0.125/2),\\ & \mbox{exp}(-0.125/2) + \mbox{exp}(-11.52/2) + \mbox{exp}(0),\\ & \mbox{exp}(0) + \mbox{exp}(-11.645/2) + \mbox{exp}(-0.125/2),\\ & \mbox{exp}(-11.645/2) + \mbox{exp}(0) + \mbox{exp}(-11.52/2),\\ & \mbox{exp}(-0.125/2) + \mbox{exp}(-11.52/2) + \mbox{exp}(0)\}/\sqrt{2\pi}\end{align}$$           Therefore the arccos of the dot product of the square root of the sum-normalized vectors is approximately 0.08741134. # Comparisons While all three of **TDA**, **rgudhi** and **TDApplied** provide functions for the bottleneck and wasserstein calculations, only **TDApplied** and **rgudhi** have the functionality to calculate the Fisher information metric. We calculated the results of each distance calculation, for each pair of $D_i$ and $D_j$ ($i \neq j$) and each package, and stored the results in tables according to the three distance metrics under comparison: ```{r,eval = F,include = F} D1_TDA <- as.matrix(D1) colnames(D1_TDA) <- NULL D2_TDA <- as.matrix(D2) colnames(D2_TDA) <- NULL D3_TDA <- as.matrix(D3) colnames(D3_TDA) <- NULL dis_bot <- rgudhi::BottleneckDistance$new() dis_wass <- rgudhi::WassersteinDistance$new() dis_fish <- rgudhi::PersistenceFisherDistance$new() # default sigma is 1 bottleneck_comparison <- data.frame(pair = c("D1 and D2","D1 and D3","D2 and D3"),ground_truth = c(0.3,0.5,0.65),TDApplied = c(diagram_distance(D1,D2,p = Inf),diagram_distance(D1,D3,p = Inf),diagram_distance(D2,D3,p = Inf)),TDA = c(TDA::bottleneck(D1_TDA,D2_TDA,dimension = 0),TDA::bottleneck(D1_TDA,D3_TDA,dimension = 0),TDA::bottleneck(D2_TDA,D3_TDA,dimension = 0)),rgudhi = c(dis_bot$apply(D1[,2:3],D2[,2:3]),dis_bot$apply(D1[,2:3],D3[,2:3]),dis_bot$apply(D2[,2:3],D3[,2:3]))) wasserstein_comparison <- data.frame(pair = c("D1 and D2","D1 and D3","D2 and D3"),ground_truth = c(0.3905125, 0.559017, 0.65),TDApplied = c(diagram_distance(D1,D2),diagram_distance(D1,D3),diagram_distance(D2,D3)),TDA = c(TDA::wasserstein(D1_TDA,D2_TDA,dimension = 0),TDA::wasserstein(D1_TDA,D3_TDA,dimension = 0),TDA::wasserstein(D2_TDA,D3_TDA,dimension = 0)),rgudhi = c(dis_wass$apply(D1[,2:3],D2[,2:3]),dis_wass$apply(D1[,2:3],D3[,2:3]),dis_wass$apply(D2[,2:3],D3[,2:3]))) fisher_comparison <- data.frame(pair = c("D1 and D2","D1 and D3","D2 and D3"),ground_truth = c(0.02354624,0.08821907,0.1139891),TDApplied = c(diagram_distance(D1,D2,distance = "fisher",sigma = 1),diagram_distance(D1,D3,distance = "fisher",sigma = 1),diagram_distance(D2,D3,distance = "fisher",sigma = 1)),rgudhi = c(dis_fish$apply(D1[,2:3],D2[,2:3]),dis_fish$apply(D1[,2:3],D3[,2:3]),dis_fish$apply(D2[,2:3],D3[,2:3]))) ``` ```{r,echo = F} bottleneck_comparison <- data.frame(pair = c("D1 and D2","D1 and D3","D2 and D3"),ground_truth = c(0.3,0.5,0.65),TDApplied = c(0.3,0.5,0.65),TDA = c(0.3,0.5,0.65),rgudhi = c(0.3,0.5,0.65)) wasserstein_comparison <- data.frame(pair = c("D1 and D2","D1 and D3","D2 and D3"),ground_truth = c(0.3905125 ,0.559017 ,0.65),TDApplied = c(0.3905125,0.559017,0.65),TDA = c(0.55,0.75,0.65),rgudhi = c(0.55,0.75,0.65)) fisher_comparison <- data.frame(pair = c("D1 and D2","D1 and D3","D2 and D3"),ground_truth = c(0.02354624 ,0.08821907 ,0.08741134),TDApplied = c(0.02354624 ,0.08821907 ,0.08741134),rgudhi = c(0.02354624 ,0.08821907 ,0.08741134)) ``` Bottleneck comparison: ```{r,echo = F} bottleneck_comparison ``` Wasserstein comparison: ```{r,echo = F} wasserstein_comparison ``` Fisher information metric comparison: ```{r,echo = F} fisher_comparison ``` All three packages agreed on the value of the three bottleneck calculations, and both **TDApplied** and **rgudhi** agreed on all Fisher information metric calculations. However, while **TDA** and **TDAstats** agreed on the three wasserstein calculations, two of these differed from **TDApplied**'s output. This occurred because **TDA** and **rgudhi** use a slightly different formula for computing wasserstein distances -- where the distance between matched pairs of persistence diagram points is Euclidean rather than an infinity-norm distance. This is a perfectly suitable distance metric and matches the formula in [@Robinson_Turner]. However, it is different from the published formulas in important works like [@distance_calc] and [@comp_geom] (which are the formulas that **TDApplied** implements). We state a quick last note of comparison between the kernel calculations in **TDApplied** and **rgudhi**. Even though their Fisher information metric calculations appear to be the same (perhaps up to small differences in algorithm precision), it turns out that **rgudhi** and **TDApplied** return drastically different kernel values. For example, the Fisher information metric between D2 and D3 was (correctly) stated as 0.08741134 for both packages. It follows that when $t = 2$ the persistence Fisher kernel value should be $\mbox{exp}(-2*0.08741134) \approx 0.8396059$, which is the exact value of the **TDApplied** calculation `diagram_kernel(D2,D3,dim = 0,t = 2)`. However, the code ```{r,eval = F} gudhi_kern <- rgudhi::PersistenceFisherKernel$new(bandwidth = 0.5) gudhi_kern$apply(D2[,2:3],D3[,2:3]) ``` returns the value 0.7550683 (note that the `bandwidth` parameter is $1/t$). Even more perplexing is that the following code ```{r,eval = F} gudhi_kern <- rgudhi::PersistenceFisherKernel$new(bandwidth_fisher = 0.5) gudhi_kern$apply(D2[,2:3],D3[,2:3]) ``` returns the value 1.8326, which should not be possible for the function $\mbox{exp}(-t*d_{FIM})$ as $t$ and $d_{FIM}$ are always positive and non-negative respectively (so the maximum value should be 1). Unfortunately we were not able to identify the source of this confusion by examining the source code of **rgudhi** (and GUDHI), but it is still possible to calculate correct kernel values as follows: ```{r,eval = F} d <- rgudhi::PersistenceFisherDistance$new() # sigma = 1 t <- 2 # or whatever desired parameter exp(-t*d$apply(D2[,2:3],D3[,2:3])) ``` Since distance calculations are much faster with **rgudhi** than with **TDApplied** (see the package vignette "Benchmarking and Speedups"), and because distance and Gram matrices can be precomputed and reused across multiple analyses (again see "Benchmarking and Speedups") it would be desirable to have a (correct) **rgudhi** Gram matrix function. An example of such a function is the following: ```{r,eval = F} # create rgudhi distance object # sigma = 0.01 gudhi_dist <- rgudhi::PersistenceFisherDistance$new(bandwidth = 0.01,n_jobs = 1L) # create list of diagrams, only birth and death values in dimension 1 g <- lapply(X = 1:10,FUN = function(X){ df <- diagram_to_df(TDA::ripsDiag(X = TDA::circleUnif(n = 50), maxdimension = 1,maxscale = 2)) return(df[which(df[,1] == 1),2:3]) }) # distance matrix function # diagrams is a list of diagrams (only birth and death columns) # gudhi_dist is the rgudhi distance object gudhi_distance_matrix <- function(diagrams,gudhi_dist){ # get number of rows of each diagram since rgudhi can't calculate # distances with empty diagrams rows <- unlist(lapply(diagrams,FUN = nrow)) inds <- which(rows > 0) # if inds is empty then return 0 matrix if(length(inds) == 0) { return(matrix(data = 0,nrow = length(diagrams),ncol = length(diagrams))) } # calculate distance matrix for non-zero-row diagrams d_non_zero <- gudhi_dist$fit_transform(diagrams[inds]) # fix diagonal which can sometimes have non-zero entries diag(d_non_zero) <- rep(0,nrow(d_non_zero)) # symmetrize (necessary due to numeric rounding issues) d_non_zero[which(upper.tri(d_non_zero),arr.ind = T)[,c("col","row")]] <- d_non_zero[upper.tri(d_non_zero)] # if all diagrams had at least one row, return if(length(inds) == length(diagrams)) { return(d_non_zero) } # create empty distance matrix d d <- matrix(data = 0,nrow = length(diagrams),ncol = length(diagrams)) # update entries of d e <- as.matrix(expand.grid(inds,inds)) e <- e[which(e[,1] < e[,2]),] if(!is.matrix(e)) { e <- t(as.matrix(e)) } d[e] <- d_non_zero[which(upper.tri(d_non_zero),arr.ind = T)] e <- e[,2:1] if(!is.matrix(e)) { e <- t(as.matrix(e)) } d[e] <- d_non_zero[which(upper.tri(d_non_zero),arr.ind = T)] return(d) } # Gram matrix function # diagrams is a list of diagrams (only birth and death columns) # t is the t parameter like in diagram_kernel # gudhi_dist is the rgudhi distance object gudhi_gram_matrix <- function(diagrams,t,gudhi_dist){ # calculate distance matrix D <- gudhi_distance_matrix(diagrams = diagrams,gudhi_dist = gudhi_dist) return(exp(-t*D)) } # calculate the Gram matrix G <- gudhi_gram_matrix(diagrams = g,t = 1,gudhi_dist = gudhi_dist) ``` Another issue with **rgudhi** calculations can be reproduced as follows: ```{r,eval = F} # create data frame D = data.frame(dimension = c(0,0),birth = c(0,0),death = c(1.089866,1.098640)) # create rgudhi distance object gudhi_dist <- rgudhi::PersistenceFisherDistance$new(bandwidth = 0.01,n_jobs = 1L) # compute distance gudhi_dist$apply(D[,2:3],D[,2:3]) ``` ```{r,eval = T,echo = F} 0.00000001490116 ``` This calculation returns a non-zero number for the distance value of D with itself. This is likely due to numerical rounding issues, and its discovery in **rgudhi** has led to an update in **TDApplied**'s `diagram_distance` function which now returns 0 for this calculation: ```{r,echo = F} D = data.frame(dimension = c(0,0),birth = c(0,0),death = c(1.089866,1.098640)) ``` ```{r} diagram_distance(D,D,distance = "fisher",sigma = 0.001) ``` # Proof of Correctness for **TDApplied**'s diagram_distance Function Even though the Hungarian algorithm can be used to solve the linear sum assignment problem (LSAP) [@R-clue], finding a minimal cost matching of two sets of points, some work needs to be done to properly apply the algorithm to calculate wasserstein or bottleneck distances. For an example we will consider the bottleneck distance, although the argument still holds with a simple change for the wasserstein distance (squaring matrix entries). Let Diag1 and Diag2 be two diagrams, with $n_1$ and $n_2$ points respectively, whose projections onto the diagonal are denoted by $\pi(\mbox{Diag1})$ and $\pi(\mbox{Diag2})$ respectively. Then take $M$ to be the following $(n_1 + n_2) \times (n_1 + n_2)$ matrix: $$M = \left[ \begin{array}{c|c} d_{\infty}(\mbox{Diag1},\mbox{Diag2}) & d_{\infty}(\mbox{Diag1},\pi(\mbox{Diag2})) \\ \hline d_{\infty}(\pi(\mbox{Diag1}),\mbox{Diag2}) & 0 \end{array} \right]$$ Each row corresponds to the $n_1$ points in Diag1 followed by the $n_2$ projections $\pi(\mbox{Diag2})$, and vice versa for the columns. Then we claim that the solution of the LSAP problem on $M$ has the same cost as the bottleneck distance value between Diag1 and Diag2. Firstly, we claim that a solution to the LSAP problem on $M$ has a cost which is no less than the distance value. Let the distance value be $s$. Now suppose, to reach a contradiction, that there existed a lower-cost matching for the LSAP problem for $M$, $m$,of cost $s' < s$. Since projection points are matched together with cost 0 in $M$, let $m'$ contain all the matches in $m$ which are not between two projection points. Then $m'$ would be matching for the distance calculation which has lower cost than $s$, contradicting the minimality of $s$. Therefore, a solution to the LSAP problem on $M$ has a cost which is no less than the distance value. Next, we claim that a solution to the LSAP problem on $M$ has a cost which is no greater than the distance value. Now suppose, to reach a contradiction, that the solution to the LSAP on problem $M$ had cost $s$, which was larger than the real distance value, $s'$. Let $m'$ be a matching for the distance calculation of $s'$. Then since each point in either diagram is either paired with a point in the other diagram or its own diagonal projection, there must be an equal number of unpaired points in both diagrams in $m'$. Therefore, we can augment $m'$ to a matching $m$ on $M$ in which the unpaired diagonal points are arbitrarily paired up with cost 0. Thus, $m$ has cost $s' < s$, contradicting the minimality of $s$. Therefore, a solution to the LSAP problem on $M$ has a cost which is no greater than the distance value. Therefore, a solution to the LSAP problem for $M$ has a cost which is both greater than and less than the bottleneck distance value, and hence the two values must be equal. ```{r,echo = F} # reset parameters par(mfrow = original_mfrow,xpd = original_xpd,mar = original_mar) options(scipen = original_scipen) do.call("RNGkind",as.list(oldRNGkind)) assign(".Random.seed", oldseed, .GlobalEnv) ``` # Conclusion ## References