Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.

Authors: A. Lheritier and F. Cazals
Comparing two sets of multivariate samples is a central problem in data analysis. From a statistical standpoint, one way to perform such a comparison is to resort to a nonparametric twosample test (TST), which checks whether the two sets can be seen as i.i.d. samples of an identical unknown distribution (the null hypothesis, denoted H0).
Accepting or rejecting H0 requires comparing the value obtained for the test statistic used against a predefined threshold, usually denoted . In rejecting H0, further analysis are called for, and in particular:
Effect size calculation: one wishes to assess the magnitude of the difference.
Effect size and feedback analysis are easy for onedimensional data, as one may use the effect size associated with a MannWhitney Utest, or may perform a density estimation and compute some kind of distance (say the Mallows distance) between the obtained distributions. However, such analysis are more complex for multivariate data.
This package fills this gap, proposing a method based upon a combination of statistical learning (regression) and computational topology methods [24] . More formally, consider two populations, each given as a point cloud in . Our analysis proceeds in three steps:
Step 1: a discrepancy measure is estimated at each data point.
Step 2: a clustering of samples is performed. A cluster reported involves samples with significant discrepancy (threshold userdefined), which in addition are in spatial proximity.
(A)  (B) 
(C)  (D) 
(E)  (F) 
(G)  
Method: input and main output. (A) Point clouds–here uniformly sampling two ellipsis. (B) Local Discrepancy on a scale 01 computed at each sample point. (C) Persistence Diagram associated with the sublevel sets of the estimated discrepancy (see text for details). (D) Simplified Disconnectivity Forest illustrating the merge events between sublevel sets. Note that four clusters stand out (the four legs of the tree). (E) Clustering: clusters identified on the disconnectivity forest, mapped back into the ambient space. Note that the green and blue clusters are associated with the red population, while the remaining two come from the blue population. (F) Effect Size Decomposition. The area under the dashed line represents the total effect size. The area under the continuous line is the maximum effect size possible. The area of each bar represents the contribution of each cluster to the effect size. (G) Discrepancy Cumulative Distribution. 
We aim at modeling the discrepancy between two datasets and , in some fixed dimension Euclidean space . We denote the total number of samples. We view these data as coming from two unknown densities and , et denote the average density .
In the first step, we assign a label to each set and we compute, for each sample point, a discrepancy measure based on comparing an estimate of the conditional probability distribution of the label given a position versus the global unconditional label distribution.
Divergence measures between distributions. To define our discrepancy measure, recall that the KullbackLeibler divergence (KLD) between two probability densities and defined as
with the conventions and .
For two discrete distributions and over a finite set , the the KullbackLeibler divergence is defined by:
The JensenShannon divergence (JSD) [79] allows one to symmetrize and smooth the KLD by taking the average KLD of and to the average density , that is:
Discrepancy measure. In taking the average in Eq. (eqJSD), two random variables are implicity defined: a position variable with density and a binary label that indicates from which original density (i.e. or ) an instance of is obtained. Formally, considering the alphabet and , the following pair of random variables is defined:
In the sequel, we will consider the conditional and unconditional mass functions and respectively, as well as the joint probability density . We will also use the notation to denote (resp.~ ) when (resp.~ ).
The connexion between the JS divergence and the previous (conditional) mass functions is given by [24] :
The previous lemma shows that the JSD can be seen as the average, over , of the KLD between the conditional and unconditional distribution of labels. More formally, we define:
The following comments are in order:
ranges between 0 and 1 and is 0 iff .
Taking for granted such an estimator (see section Discrepancy estimation for its effective calculation), we finally obtain an estimator for :
In practice, this estimate is computed using a regressor based on nearest neighbors, see section Discrepancy estimation . The number of neighbors used for each samples is set to , with a user defined parameter.
Joint distribution compatible sampling. Practically, the sizes and of the samples processed may be unbalanced, which would not be compatible with Eq. (eqpairLZ).
To ensure that balanced data sets are used in the estimation, we resort to random multiplexer generating samples , using a Bernoulli distribution to pick from the two data sets (Fig. figrandomgadget ; see also details in [24] ).
The multiplexer halts when the data on one of the two data sets has been exhausted, and the resulting pairs are used to perform the estimation. Note that selected samples from the non exhausted data set may not be used, yielding some information loss. Therefore, we may repeat the process times, so as to eventually use all samples from the larger set. Upon performing these estimates, the discrepancy estimate for a given sample is taken as the median of the estimates.
To describe the second step in simple terms–see [24] for the details, we refer to the socalled watershed transform used in image processing or topographical analysis, see Watershed image processing and Drainage basin. In a nutshell, given a height function, called the landscape for short in the sequel, the watershed transform partitions its into its catchment basins, with one basin for each local minimum.
To transpose these notions in our setting, we proceed in two steps. First, we build a nearest neighbor graph connecting the samples—e.g. we connect each sample to its nearest neighbors. Second, we lift this graph into by adding a new coordinate to each sample, namely its estimated discrepancy. This lifted graph provides a discrete representation for the aforementioned landscape. We now describe our method using this landscape (Fig. figsamplestoheightfunction ).
On the landscape, we focus on samples whose estimated discrepancy is less than , with a user defined threshold. The region of the landscape matching this specification is called the sublevel set associated with . We define a cluster as a connected component of this sublevel set. Note that each such cluster may be associated with the lowest local minimum of this connected component. However, in case of noisy estimates, many small cluster may be reported. To ease the interpretation of results, we therefore smooth the landscape, and focus on c.c. of the smoothed landscape. Smoothing is carried out using topological persistence [54] , which consists of canceling local minima associated with non significant basins. More precisely, recall that the persistence of a local minimum is the height difference to the lowest saddle connecting this local minimum to a deeper local minimum. We retain local minima whose persistence is above a threshold .
To recap, our analysis aims at:
Identifying samples with significant discrepancy.
Method: algorithm illustrated on a toy 1D example (A, discrepancy estimation) A discrepancy is estimated at each sample (see text for details). (B, clusters as connected components) Upon reverting the sign of the discrepancy, clusters are defined as connected components (c.c.) of the region of the previous curve below a user defined threshold . On this example, four such components are observed, two large ones and two small ones. Note that each cluster can be charged to one local minimum of the function . (C, smoothed clusters upon applying topological persistence) To get rid of small clusters, the curve is smoothed using topological persistence. This simplification yields two clusters. Note that the blue sample squeezed between and does not appear in the red cluster since its discrepancy is below (and likewise for the red point with respect to and ). 
This step consists of adding up the discrepancies on a per sample basis in a cluster :
Putting everything together, the program performs the following plots:
Persistence diagram: A plot showing for each minima a red cross with coordinates corresponding to its birth and death dates respectively, while analyzing the landscape whose elevation is the negative estimated discrepancy.
Discrepancy cumulative distribution function: The cumulative distribution function of the estimated discrepancies.
Disconnectivity forest: a hierarchical representation of the merge events associated with sublevel sets of the estimated discrepancy.
Clusters: A plot similar to the raw data plot, with one color per cluster. The points not belonging to any cluster are colored in gray.
For high dimensional date, we also provide the following 2D embeddings:
Raw data embedding: For samples embedded in 2D or 3D space, a plot of the points with a color to indicate the label (blue: ; red: ). For samples embedded in a higher dimensional space, a 2D embedding of these samples obtained using multidimensional scaling (MDS). In any case, the goal of this plot is to intuitively visualize the distributions of the two populations.
Random multiplexer generating pairs (label, position). 
In the sequel, we specify the main steps and the options offered, see also Fig. figworkflowddbc .
The first step is carried out by the python program . Its main input are two files listing points. A basic calculation is launched as follows:
> sblddbcstep1discrepancy.py f data/ellipse_w_2_h_10_n_1000__X.dat g data/ellipse_w_2_h_10_n_1000__Y.dat theta0 0.5 kofnpower 0.66 B 1 directory results
File Name  Description 
Points text file  A list of points in a given dimension. 
Points text file  A list of points in a given dimension. 
Optional step. This step aims to provide plots that may be used in the third step. It is carried out by the python program . Its main input is the prefix of the output files of . A basic calculation is launched as follows:
> sblddbcdiscbasedcoloredembedding.py f results/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y
See Table tableddbcoutputstep1 to see the input files of .
The second step is carried out using the python program . It uses the executable , from the SBL, and its main input is the set of points with their discrepancy, output of . A basic calculation is launched as follows:
> sblddbcstep2clustering.py p results/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.pointsWdim w results/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.heightDiv n 6 P rho 0.1 deltamax 0.1 d results o ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y v
Note that if the executable is not in the path, the path to may be specified using the options e. See Table tableddbcoutputstep1 to see the input files of .
The third step is carried out using the python program . Its main input is the clustering made by , with the output points of the program . A basic calculation is launched as follows:
> sblddbcstep3clusterplots.py f results/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.vertices c results/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y__sublevelsets_connected_components.xml p results/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.2Dpoints
See Tables tableddbcoutputstep1 , tableddbcoutputoptionalstep and tableddbcoutputstep2 to see the input files of .
Step 1.
Preview  File Name  Description 
Points text file  A list of discrepancy, one per point.  
Weights text file  A list of discrepancy, one per point.  
Text file  Classification of points... 
Optional step.
Preview  File Name  Description 
Points text file  A list of 2D points from the input points.  
Populations image  2D representation of the populations.  
Discrepancy image  2D representation of the discrepancy of the points. 
Step 2.
Preview  File Name  Description 
Disconnectivity Forest as EPS file  Image of the disconnectivity forest.  
Persistence Diagram as PDF file  Image of the persistence diagram.  
Clusters as XML file  Clusters of points.  
NNG as XML file  Nearest Neighbors Graph. 
Note that there are other files printed by the program that are not of interest for this application. We refer the reader to the package Morse_theory_based_analyzer for more details about these files.
Step 3.
Preview  File Name  Description 
Clustering image  2D representation of the clusters.  
Barplots image  Barplot representation of the clustering.  
Discrepancy Cumulative Distribution image  Curve representing the discrepancy cumulative distribution function. 
An example illustrating the processing of a mixture of Gaussians is provided on Fig. figgaussiansinout .
In this example – see [24] for the detailed specification, the first population is drawn according to a mixture of four spherical Gaussians with equal probability, while the second one is drawn according to a mixture of four nonspherical Gaussians with equal probability
The different steps are run as following:
(A)  (B) 
(C)  (D) 
(E)  (F) 
(G)  
Processing a mixture of Gaussians. The panels are identical to those in Fig. figellipsisinout . The following values were used: persistence simplification at threshold ; clustering with . 
An example illustrating the processing of real data in high dimension, we process a mixture of handwritten digits, each represented as an grayscale image – see the MNIST dataset [75] .
The two populations see also figdigitsinout , were chosen as follows:
The different steps are run as following:
(A)  (B) 
(C)  (D) 
(E)  (F) 
(G)  
Processing a mixture of handwritten digits represented as images (digits: 3, 6, 8). The panels are identical to those in Fig. figellipsisinout . The following values were used: persistence simplification at threshold ; clustering with . 
In this section, we explain how to compute the estimator of Eq. (eqdeltaz), using the framework of regression [63] , and more precisely regressors based on nearest neighbors. We define:
In the regression problem, the goal is to build an estimator of using a set of i.i.d.~realizations of .
In order to apply this framework to our problem, note that the correspondence yields
Then, we can use the following estimator for
Note that it is required that , since we aim at estimating a conditional probability, and that it is satisfied by the nearest neighbor regressor.
Using Eq. (eqestimpmflz), we finally obtain an estimator for :
Precisely defining persistence is beyond the scope of this user manual, and the reader is referred to [54] , as well as to [24] for all details.
We comment on three issues, though:
Practically, our algorithm operates in a discrete setting wince the input consists of samples. The samples are used to build a nearest neighbor graph (NNG), by connecting each sample to a number of nearest neighbors (for a suitable distance metric), from which local minima and saddles can be identified [33] .
Equipped with a NNG, reporting the connected components (c.c.) of a sublevel set is a mere application of the socalled unionfind algorithm, whose complexity is essentially linear in the number of samples processed. See [33] for details.
For large datasets though, in exploratory data analysis, the previous unionfind algorithm may be called for varying values . In that case, a more suitable strategy consists of working on the critical points of the height function (the estimated discrepancy), once all samples which are regular points have been assigned to their socalled stable manifolds. The persistence can then be run on the graph of critical points only (the MorseSmaleWitten complex), as explained in [22] .
Method: workflow. In blue: the parameters. 
The following two external dependencies are used in this applications: