Density_difference_based_clustering

Authors: A. Lheritier and F. Cazals

Goals

Comparing two sets of multivariate samples is a central problem in data analysis. From a statistical standpoint, one way to perform such a comparison is to resort to a non-parametric two-sample test (TST), which checks whether the two sets can be seen as i.i.d. samples of an identical unknown distribution (the null hypothesis, denoted H0).

Accepting or rejecting H0 requires comparing the value obtained for the test statistic used against a pre-defined threshold, usually denoted $\alpha$ . In rejecting H0, further analysis are called for, and in particular:

Effect size calculation: one wishes to assess the magnitude of the difference.
Feedback analysis: one wishes to identify regions in parameter space accounting for this difference.

Effect size and feedback analysis are easy for one-dimensional data, as one may use the effect size associated with a Mann-Whitney U-test, or may perform a density estimation and compute some kind of distance (say the Mallows distance) between the obtained distributions. However, such analysis are more complex for multivariate data.

This package fills this gap, proposing a method based upon a combination of statistical learning (regression) and computational topology methods [37] . More formally, consider two populations, each given as a point cloud in $\Rnt^d$ . Our analysis proceeds in three steps:

Step 1: a discrepancy measure is estimated at each data point.
Step 2: a clustering of samples is performed. A cluster reported involves samples with significant discrepancy (threshold user-defined), which in addition are in spatial proximity.
Step 3: the effect size is decomposed in a sum of contributions per cluster.


(A)	(B)

(C)	(D)

(E)	(F)

(G)
Method: input and main output. (A) Point clouds–here uniformly sampling two ellipsis. (B) Local Discrepancy on a scale 0-1 computed at each sample point. (C) Persistence Diagram associated with the sublevel sets of the estimated discrepancy (see text for details). (D) Simplified Disconnectivity Forest illustrating the merge events between sublevel sets. Note that four clusters stand out (the four legs of the tree). (E) Clustering: clusters identified on the disconnectivity forest, mapped back into the ambient space. Note that the green and blue clusters are associated with the red population, while the remaining two come from the blue population. (F) Effect Size Decomposition. The area under the dashed line represents the total effect size. The area under the continuous line is the maximum effect size possible. The area of each bar represents the contribution of each cluster to the effect size. (G) Discrepancy Cumulative Distribution.

Using the programs

Pre-requisites

We aim at modeling the discrepancy between two datasets $\seqLS{x}{\nbsamplesX} \equiv {x_1,\dots,x_{\nbsamplesX} }$ and $\seqLS{y}{\nbsamplesY} \equiv {y_1,\dots,y_{\nbsamplesY} }$ , in some fixed dimension Euclidean space $\Rnt^d$ . We denote $n=\nbsamplesX+\nbsamplesY$ the total number of samples. We view these data as coming from two unknown densities $\densd{X}$ and $\densd{Y}$ , et denote $\dens$ the average density $\dens \equiv \nicefrac{(\densdX+\densdY)}{2}$ .

For the sake of conciseness, the X and Y populations are respectively assigned the labels 0 and 1, and are respectively represented in blue and red.

Step 1

In the first step, we assign a label to each set and we compute, for each sample point, a discrepancy measure based on comparing an estimate of the conditional probability distribution of the label given a position versus the global unconditional label distribution.

Divergence measures between distributions. To define our discrepancy measure, recall that the Kullback-Leibler divergence (KLD) between two probability densities and defined as

$\divKL{f}{g} \equiv \int_{-\infty}^\infty f(x) \log \frac{f(x)}{g(x)} dx$

with the conventions $0 \log 0 = 0$ and $0 \log \frac{0}{0}=0$ .

For two discrete distributions and over a finite set $\alphabet$ , the the Kullback-Leibler divergence is defined by:

$\divKL{P}{Q} \equiv \sum_{\rvl\in\alphabet} P(\rvl) \log \frac{P(\rvl)}{Q(\rvl)}$

The Jensen-Shannon divergence (JSD) [119] allows one to symmetrize and smooth the KLD by taking the average KLD of $\densdX$ and $\densdY$ to the average density $\dens$ , that is:

$\JSD{\densdX}{\densdY} \equiv \frac{1}{2}\left(\divKL{\densdX}{\dens} + \divKL{\densdY}{\dens}\right)$

Discrepancy measure. In taking the average in Eq. (eq-JSD), two random variables are implicity defined: a position variable $\rvP$ with density $\densdZ \equiv \dens$ and a binary label $\rvL$ that indicates from which original density (i.e. $\densdX$ or $\densdY$ ) an instance of $\rvP$ is obtained. Formally, considering the alphabet $\mathcal{A} = {\labelX,\labelY}$ and $X\sim F_X,Y\sim F_Y$ , the following pair of random variables is defined:

$(\rvL,\rvP)= \begin{cases} (\labelX,X) & \text{ with prob. } \frac{1}{2} \\ (\labelY,Y) & \text{ with prob. } \frac{1}{2} \end{cases}$

In the sequel, we will consider the conditional and unconditional mass functions $\pmflz=\probaCond{\rvL=\rvl}{\rvP=\rvp}$ and $\pmfl=\proba{\rvL=\rvl}=\frac{1}{2}$ respectively, as well as the joint probability density $\densdLZ$ . We will also use the notation $\densd{\rvl}$ to denote $\densdX$ (resp.~ $\densdY$ ) when $\rvl=\labelX$ (resp.~ $\rvl=\labelY$ ).

The connexion between the JS divergence and the previous (conditional) mass functions is given by [37] :

One has: $\JSD{\densdX}{\densdY} = \int_{\Rnt^d} \densdZ{z} \divKL{\cpmf[\cdot][\rvp]}{\cpmf[\cdot]} dz$

The previous lemma shows that the JSD can be seen as the average, over $z\in\Rnt^d$ , of the KLD between the conditional and unconditional distribution of labels. More formally, we define:

The discrepancy at location

is defined as the KL divergence: $\deltaz \equiv \divKL{\cpmf[\cdot][\rvp]}{\cpmf[\cdot]}.$

The following comments are in order:

$\deltaz$ ranges between 0 and 1 and is 0 iff $\densdX{\rvp}=\densdY{\rvp}$ .
Since $\pmfl$ is known but $\pmflz$ is not, the problem we consider now is the one of estimating $\pmflz$ at each given location $\rvp$ . In the sequel, such an estimator is denoted $\cestimi{n}{\rvl}{\rvp}$ .

Taking for granted such an estimator (see section Discrepancy estimation for its effective calculation), we finally obtain an estimator for $\deltaz$ :

$\deltazestim{n} \equiv \divKL{\cestimi{n}{\cdot}{\rvp}}{\cpmf[\cdot]}$

In practice, this estimate is computed using a regressor based on nearest neighbors, see section Discrepancy estimation . The number of neighbors used for each samples is set to $n^{\kofnpower}$ , with $\kofnpower$ a user defined parameter.

Joint distribution compatible sampling. Practically, the sizes and of the samples processed may be unbalanced, which would not be compatible with Eq. (eq-pair-LZ).

To ensure that balanced data sets are used in the estimation, we resort to random multiplexer generating samples , using a Bernoulli distribution to pick from the two data sets (Fig. fig-random-gadget ; see also details in [37] ).

The multiplexer halts when the data on one of the two data sets has been exhausted, and the resulting pairs are used to perform the estimation. Note that selected samples from the non exhausted data set may not be used, yielding some information loss. Therefore, we may repeat the process times, so as to eventually use all samples from the larger set. Upon performing these estimates, the discrepancy estimate for a given sample is taken as the median of the estimates.

For a given estimation fold, we denote $n' \leq n_0+n_1$ the number of samples selected by the random multiplexer. The regressor used to estimate the discrepancy uses these

samples only. On the other hand, using this model, the discrepancy is estimated on all

samples. See also Fig. fig-workflow-ddbc for the articulation between the three steps.

Step 2

To describe the second step in simple terms–see [37] for the details, we refer to the so-called watershed transform used in image processing or topographical analysis, see Watershed image processing and Drainage basin. In a nutshell, given a height function, called the landscape for short in the sequel, the watershed transform partitions its into its catchment basins, with one basin for each local minimum.

To transpose these notions in our setting, we proceed in two steps. First, we build a nearest neighbor graph connecting the samples—e.g. we connect each sample to its nearest neighbors. Second, we lift this graph into $\mathbb{R}^{d+1}$ by adding a new coordinate to each sample, namely its estimated discrepancy. This lifted graph provides a discrete representation for the aforementioned landscape. We now describe our method using this landscape (Fig. fig-samples-to-height-function ).

On the landscape, we focus on samples whose estimated discrepancy is less than $-\deltaMax$ , with $\deltaMax$ a user defined threshold. The region of the landscape matching this specification is called the sublevel set associated with $-\deltaMax$ . We define a cluster as a connected component of this sublevel set. Note that each such cluster may be associated with the lowest local minimum of this connected component. However, in case of noisy estimates, many small cluster may be reported. To ease the interpretation of results, we therefore smooth the landscape, and focus on c.c. of the smoothed landscape. Smoothing is carried out using topological persistence [79] , which consists of canceling local minima associated with non significant basins. More precisely, recall that the persistence of a local minimum is the height difference to the lowest saddle connecting this local minimum to a deeper local minimum. We retain local minima whose persistence is above a threshold $\rho$ .

To recap, our analysis aims at:

Identifying samples with significant discrepancy.
Grouping such samples into clusters, the samples in one cluster being associated with significant local maxima of the estimated discrepancy.

Method: algorithm illustrated on a toy 1D example
(A, discrepancy estimation) A discrepancy is estimated $\hat{\delta}(z_i)$ at each sample

(see text for details). (B, clusters as connected components) Upon reverting the sign of the discrepancy, clusters are defined as connected components (c.c.) of the region of the previous curve below a user defined threshold $-\deltaMax$ . On this example, four such components are observed, two large ones and two small ones. Note that each cluster can be charged to one local minimum of the function $-\hat{\delta}$ . (C, smoothed clusters upon applying topological persistence) To get rid of small clusters, the curve $-\hat{\delta}$ is smoothed using topological persistence. This simplification yields two clusters. Note that the blue sample squeezed between

and

does not appear in the red cluster since its discrepancy is below $\deltaMax$ (and likewise for the red point with respect to

and

).

Step 3

This step consists of adding up the discrepancies on a per sample basis in a cluster :

$\JSDi{C}{\densdX}{\densdY} \equiv \frac{1}{n}\sum_{z \in \seqLS{z}{n} \cap C}\deltazestim{n}$ $\JSDi{C}{\densdX}{\densdY} \equiv \frac{1}{\nbsamplesX+\nbsamplesY}\sum_{z \in (\seqLS{x}{\nbsamplesX} \cup \seqLS{y}{\nbsamplesY}) \cap C} \hat{\delta}(\rvp).$

Plots

Putting everything together, the program $\text{\feedback}$ performs the following plots:

Persistence diagram: A plot showing for each minima a red cross with coordinates corresponding to its birth and death dates respectively, while analyzing the landscape whose elevation is the negative estimated discrepancy.
Discrepancy cumulative distribution function: The cumulative distribution function of the estimated discrepancies.
Disconnectivity forest: a hierarchical representation of the merge events associated with sublevel sets of the estimated discrepancy.
Clusters: A plot similar to the raw data plot, with one color per cluster. The points not belonging to any cluster are colored in gray.
JSD decomposition plot: A 1D plot presenting a synthetic view without relying on the MDS embedding. The coordinate represents the sample space and always ranges from 0 to
1. The total estimated JSD is represented by the area under the dashed line. And the maximum possible JSD which is always 1 is represented by the area under the continuous line. One bar is depicted for each cluster plus another one (the last one) for the points not belonging to any cluster. The area of each bar represents the contribution of the corresponding cluster to the total JSD and its color corresponds to the proportion of samples $\labelX$ in the cluster.

For high dimensional date, we also provide the following 2D embeddings:

Raw data embedding: For samples embedded in 2D or 3D space, a plot of the points with a color to indicate the label (blue: $\labelX$ ; red: $\labelY$ ). For samples embedded in a higher dimensional space, a 2D embedding of these samples obtained using multi-dimensional scaling (MDS). In any case, the goal of this plot is to intuitively visualize the distributions of the two populations.
Discrepancy shaded data embedding — aka discrepancy plot: A plot similar to the raw data plot, except that each sample is color coded using a heat palette from fully transparent white to red across yellow, as a function of the value of the estimated discrepancy $\deltazestim{n}$ . That is, a point with $\deltazestim{n}$ equal to zero (resp.~one) is colored fully transparent white (resp.~non transparent red).

Random multiplexer generating pairs (label, position).

Specifications and File Types

In the sequel, we specify the main steps and the options offered, see also Fig. fig-workflow-ddbc .

Step 1: using sbl-ddbc-step-1-discrepancy.py and sbl-ddbc-2D-colored-embedding.py

The first step is carried out by the python program $\text{\feedback}$ . The script requires:

the two input data points, here 2D points,
the value theta0
the exponent used to define the knn regressor
the number of iterations B to define the discrepancy estimate

The script generates:

A file containing the coordinates, Point_d format, of the merged population (.pointd)
A file containing one line for each point in the merge population, and two values: first a label (0 or 1) indicating to which population the point belongs; and second, the estimated discrepancy (.vertices)
A file containing the negative estimated discrepancy – used for the persistence based analysis (.heightDiv)

Main options:

The main options of the program $\text{\feedback}$ are:
-f string: First input file in matrix format (one row = one observation/sample). NB: these samples are assigned the label 0.
-g string: Second input file in matrix format, as above. NB: these samples are assigned the label 1.
–directory string(= "."): Directory for the output files
–theta0 float(= 0.5): theta0 parameter for the random multiplexer
–regressor string: Regressor executable used to perform k NN regression
–kofn-power float(= 0.66): Power p in kofn=k^p, to set the number of neighbors used by the regressor
-B float(= 1): Number of iterations of the resampling process

Example:

> sbl-ddbc-step1-discrepancy.py -f data/ellipse_w_2_h_10_n_1000__X.dat  -g data/ellipse_w_2_h_10_n_1000__Y.dat  --theta0 0.5  --kofn-power  0.66 -B 1 --directory results

Optional step: 2D embedding

This step aims to provide a 2D embedding of the points, for visualization purposes. The embedding is obtained from Multi-dimensional scaling . While the default option consists of using the Euclidean distance to obtain the embedding, it is also possible to pass a more general distance matrix. Relevant such distances in the context of molecular modeling are the least Root Mean Square Deviation, as well as distances on the flat torus when dealing with internal coordinates. See the package Molecular_distances .

The corresponding plots may be used in the third step. It is carried out by the python program $\text{\ddbcPlot}$ . Its main input is the prefix of the output files of $\text{\feedback}$ . The script requires a prefix from which the required files are inferred.

The script generates 3 files:

points embedded into 2D i.e. xy coordinates, for visualization purposes
a colored image representing the two population embedded into 2D (.png)
a colored image representing the estimated discrepancy at each point (.png)

Main options:

The main options of the program $\text{\ddbcPlot}$ are:
-f string: File prefix for the merged population
-d string: Distance used to compute the equivalent of the Gram matrix from which the embedding is derived

Example:

> sbl-ddbc-2D-colored-embedding.py -f results/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y

Step 2: using sbl-ddbc-step2-clustering.py

The second step is carried out using the python program $\text{\ddbcClustering}$ . It uses the executable $\text{\sblMTBAnngeuclid}$ , from the SBL, and its main input is the set of points with their discrepancy, output of $\text{\feedback}$ . The script requires:

The file containing the coordinates of points in the merged population (.pointd)
The file containing the negative estimated discrepancy (.heightDiv, see step one)
The parameters required to run the persistence based analysis, see paper

The script generates a number of files, the key ones being:

The persistence diagram (pdf format)
The disconnectivity forest indicating the merge event yielding the clusters returned (eps format)
The clusters of points (xml file)

Main options:

The main options of the program $\text{\ddbcClustering}$ are:
-p string: Height function: samples i.e. observations, in point d format (.pointsWdim file format, see step 1)
-w string: Height function: elevation i.e. discrepancy computed at step 1 (.heightDiv file format, see step 1)
-d string(= "."): Directory for output files (default is working directory)
-o string: Prefix for output files
-n int: Number of neighbors used to build the Nearest Neighbors Graph
-P: Order preserving perturbation of heights to break ties between neighboring samples
–rho float: 'Persistence threshold rho (default: no persistence)
–deltamax float: Sublevelset threshold deltaMax (default: no sublevelset)
-v: Verbose mode

Example:

> sbl-ddbc-step2-clustering.py -p results/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.pointsWdim -w results/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.heightDiv -n 6 -P --rho 0.1 --deltamax 0.1 -d results -o ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y -v

Note that if the executable $\text{\sblMTBAnngeuclid}$ is not in the path, the path to $\text{\sblMTBAnngeuclid}$ may be specified using the options -e.

Note that there are other files printed by the program $\text{\ddbcClustering}$ that are not of interest for this application. We refer the reader to the package Morse_theory_based_analyzer for more details about these files.

Step 3: using sbl-ddbc-step3-cluster-plots.py

The third step is carried out using the python program $\text{\ddbcPlotClustering}$ . Its main input is the clustering made by $\text{\ddbcClustering}$ , with the output points of the program $\text{\feedback}$ . The script requires:

The file providing the estimated discrepancy per point (.heighDiv, see step one)
The clusters (.xml, see step 2)
The optional 2D embedding computed at the optional step, for visualization purposes

The script generates three images:

A 2D embedding of the clusters (optional)
The discrepancy bars
The discrepancy cumulative distribution

Main options:

The main options of the program $\text{\ddbcPlotClustering}$ are:
-f string: Vertices filename generated in step 1 (.vertices file format, see step 1)
-c string: Connected components filename generated in step 2 (*__sublevelsets_connected_components.xml file format, see step 2)
-p string: Optional 2D embedding generated by sbl-ddbc-2D-colored-embedding.py (.2Dpoints file format)

Example:

> sbl-ddbc-step3-cluster-plots.py -f results/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.vertices -c results/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y__sublevelsets_connected_components.xml -p results/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.2Dpoints

Algorithms and Methods

Discrepancy estimation

In this section, we explain how to compute the estimator of Eq. (eq-deltaz), using the framework of regression [92] , and more precisely regressors based on nearest neighbors. We define:

Given a random vector

, where $Z\in\Rd$ and the response variable $R\in\Rnt$ , the $\text{\emph{regression function}}$ is defined as $m(z)=\condexpecd{R}{Z=z}.$

In the regression problem, the goal is to build an estimator of using a set of i.i.d.~realizations of .

In order to apply this framework to our problem, note that the correspondence yields

$m(\rvp)=\cpmf[1][\rvp].$

Then, we can use the following estimator for $\pmflz$

$\cestimi{n}{\rvl}{\rvp} \equiv | 1 - \rvl - m_n(\rvp) |.$

Note that it is required that $0\leq m_n(\rvp) \leq 1$ , since we aim at estimating a conditional probability, and that it is satisfied by the -nearest neighbor regressor.

Using Eq. (eq-estim-pmflz), we finally obtain an estimator for $\deltaz$ :

$\deltazestim{n} \equiv \divKL{\cestimi{n}{\cdot}{\rvp}}{\cpmf[\cdot]}$

Two options are provided to compute these estimates using k nearest neighbor regression:

Option 1: using scikitlearn, see sklearn.neighbors.KNeighborsRegressor
Option 2: using a k nearest neighbor regressor based on the spatial search engines Spatial_search from the SBL.

Pros and cons are as follows:

The scikitlearn implementation is a standard one, which may face difficulties (swap issues) for large datasets (for say databases with more than 1 million of points).
Regressors based on the Spatial_search use sparser data structures, and are also generic, since both the representation and the distance used can be specified. For example the following regressor uses points on the flat torus and the associated distance: $\text{\codecx{sbl-knn-regressor-angular.exe}}$ .

Persistence analysis

Precisely defining persistence is beyond the scope of this user manual, and the reader is referred to [79] , as well as to [37] for all details.

We comment on three issues, though:

Practically, our algorithm operates in a discrete setting wince the input consists of samples. The samples are used to build a nearest neighbor graph (NNG), by connecting each sample to a number of nearest neighbors (for a suitable distance metric), from which local minima and saddles can be identified [52] .
Equipped with a NNG, reporting the connected components (c.c.) of a sublevel set is a mere application of the so-called union-find algorithm, whose complexity is essentially linear in the number of samples processed. See [52] for details.
For large datasets though, in exploratory data analysis, the previous union-find algorithm may be called for varying values $-\deltaMax$ . In that case, a more suitable strategy consists of working on the critical points of the height function (the estimated discrepancy), once all samples which are regular points have been assigned to their so-called stable manifolds. The persistence can then be run on the graph of critical points only (the Morse-Smale-Witten complex), as explained in [34] .

Programmer's Workflow

Method: workflow.
In blue: the parameters.

External dependencies

The following two external dependencies are used in this applications:

Python Data Analysis Library, pandas, see http://pandas.pydata.org/ (BSD) : data frames manipulation functionalities.

scikit-learn, sklearn, see http://scikit-learn.org/stable/ (BSD) : used to perform dimensionality reduction (MDS) and regression.

Jupyter demo

See the following jupyter demos:

Jupyter notebook file
Density_difference_based_clustering
Density_difference_based_clustering¶

Python functions to execute the three steps¶
Step 1 : computing the discrepancy measure -- and a 2D embedding¶
Options¶
The options of the step1 method in the next cell are:

points1: First input file in matrix format

points2: Second input file in matrix format

theta0: theta0 parameter for the random multiplexer

kofnPower: Power p in kofn=k^p, to set the number of neighbors used by the regressor

resampling: Number of iterations of the resampling process
In [1]:

from SBL import SBL_pytools from SBL_pytools import SBL_pytools as sblpyt #help(sblpyt) import re #regular expressions import sys #misc system import os import pdb import shutil # python 3 only def step1(odir, points1, points2, theta0 = 0.5, kofnPower = 0.66, resampling = 1): if os.path.exists(odir): os.system("rm -rf %s" % odir) os.system( ("mkdir %s" % odir) ) # check executable exists and is visible exe = shutil.which("sbl-ddbc-step1-discrepancy.py") exe_opt = shutil.which("sbl-ddbc-2D-colored-embedding.py") if exe and exe_opt: print(("Using executable %s\n" % exe)) cmd = "sbl-ddbc-step1-discrepancy.py -f %s -g %s --directory %s" % (points1, points2, odir) cmd += " --theta0 %f --kofn-power %f -B %d > log.txt" % (theta0, kofnPower, resampling) print(("Running %s" % cmd)) os.system(cmd) #sblpyt.show_log_file(odir) cmd = "ls %s" % odir ofnames = os.popen(cmd).readlines() prefix = os.path.splitext(ofnames[0])[0] cmd = "%s -f %s/%s" % (exe_opt, odir, prefix) print(("Running %s" % cmd)) os.system(cmd) print("All output files:") print(os.popen(cmd).readlines()) images = [] images.append( sblpyt.find_and_convert("2Dembedding-discrepancy.png", odir, 100) ) images.append( sblpyt.find_and_convert("2Dembedding-populations.png", odir, 100) ) print(images) sblpyt.show_row_n_images(images, 100) else: print("Executable not found")
Step 2 : Clustering samples via persistence based analysis¶
Options¶
The options of the step1 method in the next cell are:

points: Samples in point d format

weights: Elevation file

n: Number of neighbors used to build the Nearest Neighbors Graph

P: Order preserving perturbation of heights to break ties between neighboring samples

rho: Persistence threshold rho

deltamax: Sublevelset threshold deltaMax
In [2]:

def step2(odir, ifn_points, ifn_weights, n = 6, P = True, rho = 0.1, deltamax = 0.1): # check executable exists and is visible exe = shutil.which("sbl-ddbc-step2-clustering.py") if exe: print(("Using executable %s\n" % exe)) cmd = "sbl-ddbc-step2-clustering.py -p %s/%s -w %s/%s -d %s -v" % (odir, ifn_points, odir, ifn_weights, odir) cmd += " -n %d --rho %f --deltamax %f" % (n, rho, deltamax) if P: cmd += " -P" cmd += " > log.txt" print(("Running %s" % cmd)) os.system(cmd) cmd = "ls %s" % odir ofnames = os.popen(cmd).readlines() print("All output files:",ofnames) prefix = os.path.splitext(ofnames[0])[0] images = [] images.append( sblpyt.find_and_convert("disconnectivity_forest.eps", odir, 100) ) images.append( sblpyt.find_and_convert("persistence_diagram.pdf", odir, 100) ) print(images) sblpyt.show_row_n_images(images, 100) else: print("Executable not found")
Step 3 : Cluster decomposition¶
Options¶
The options of the step3 method in the next cell are:

vertices: Vertices filename generated in step 1

components: Connected components filename generated in step 2

points: Optional 2D embedding generated by sbl-ddbc-2D-colored-embedding.py
In [3]:

def step3(odir, ifn_vertices, ifn_clusters, ifn_points): # check executable exists and is visible exe = shutil.which("sbl-ddbc-step3-cluster-plots.py") if exe: print(("Using executable %s\n" % exe)) cmd = "sbl-ddbc-step3-cluster-plots.py -f %s/%s -c %s/%s -p %s/%s > log.txt" % \ (odir,ifn_vertices, odir, ifn_clusters, odir, ifn_points) print(("Running %s" % cmd)) os.system(cmd) #log = open("log.txt").readlines() #for line in log: print(line.rstrip()) images = [] images.append( sblpyt.find_and_convert("step3-2Dembedding-clusters.png", odir, 100) ) images.append( sblpyt.find_and_convert("step3-bars.png", odir, 100) ) images.append( sblpyt.find_and_convert("step3-discrepancy-cumulative-distribution.png", odir, 100) ) sblpyt.show_row_n_images(images, 100) print(images) else: print("Executable not found")
Illustration 1: ellipsis in 2D¶

Step 1: computing the discrepancy¶
In [6]:

print("Marker : Calculation Started") odir_ellipsis = "exple-ellipsis" step1(odir_ellipsis, "data/ellipse_w_2_h_10_n_1000__X.dat", "data/ellipse_w_2_h_10_n_1000__Y.dat") print("Marker : Calculation Ended")
Marker : Calculation Started Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-ddbc-step1-discrepancy.py Running sbl-ddbc-step1-discrepancy.py -f data/ellipse_w_2_h_10_n_1000__X.dat -g data/ellipse_w_2_h_10_n_1000__Y.dat --directory exple-ellipsis --theta0 0.500000 --kofn-power 0.660000 -B 1 > log.txt Running /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-ddbc-2D-colored-embedding.py -f exple-ellipsis/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y All output files: [] ['exple-ellipsis/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y-step2-2Dembedding-discrepancy.png', 'exple-ellipsis/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y-step2-2Dembedding-populations.png']

Figs displayed Marker : Calculation Ended
Step 2: persistence based analysis¶
In [8]:

print("Marker : Calculation Started") step2(odir_ellipsis, "ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.pointd", "ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.heightDiv") print("Marker : Calculation Ended")
Marker : Calculation Started Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-ddbc-step2-clustering.py Running sbl-ddbc-step2-clustering.py -p exple-ellipsis/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.pointd -w exple-ellipsis/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.heightDiv -d exple-ellipsis -v -n 6 --rho 0.100000 --deltamax 0.100000 -P > log.txt All output files: ['ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.2Dpoints\n', 'ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.heightDiv\n', 'ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.pointd\n', 'ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y-step2-2Dembedding-discrepancy.png\n', 'ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y-step2-2Dembedding-populations.png\n', 'ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.vertices\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__disconnectivity_forest.eps\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__MSW_chain_complex.xml\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__nng.xml\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_diagram.pdf\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_diagram.plot\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_diagram.txt\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_histogram.pdf\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_histogram.r\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_histogram.txt\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__stable_manifold_clusters.txt\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__stable_manifold_partition.xml\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__sublevelsets_connected_components.xml\n'] ['exple-ellipsis/sbl-Morse-theory-based-analyzer-nng-euclid__disconnectivity_forest.png', 'exple-ellipsis/sbl-Morse-theory-based-analyzer-nng-euclid__persistence_diagram.png']

Figs displayed Marker : Calculation Ended
Step 3: visualization of clusters¶
In [9]:

print("Marker : Calculation Started") step3(odir_ellipsis, "ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.vertices", "sbl-Morse-theory-based-analyzer-nng-euclid__sublevelsets_connected_components.xml", "ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.2Dpoints") print("Marker : Calculation Ended")
Marker : Calculation Started Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-ddbc-step3-cluster-plots.py Running sbl-ddbc-step3-cluster-plots.py -f exple-ellipsis/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.vertices -c exple-ellipsis/sbl-Morse-theory-based-analyzer-nng-euclid__sublevelsets_connected_components.xml -p exple-ellipsis/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y.2Dpoints > log.txt

Figs displayed ['exple-ellipsis/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y-step3-2Dembedding-clusters.png', 'exple-ellipsis/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y-step3-bars.png', 'exple-ellipsis/ellipse_w_2_h_10_n_1000__X_ellipse_w_2_h_10_n_1000__Y-step3-discrepancy-cumulative-distribution.png'] Marker : Calculation Ended
Illustration 2: 2D Gaussian¶
Another 2D example, with two populations of 2000 points, each drawn from a Gaussian mixture

the first population: 2000 points drawn according to a mixture of four spherical Gaussians with equal probability.

the second population: drawn according to a mixture of four non-spherical Gaussians with equal probability.
In [10]:

print("Marker : Calculation Started") odir_gaussian = "exple-gaussian" step1(odir_gaussian, "data/gaussianMixture6__d_2_n_2000_X.dat", "data/gaussianMixture6__d_2_n_2000_Y.dat") print("Marker : Calculation Ended")
Marker : Calculation Started Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-ddbc-step1-discrepancy.py Running sbl-ddbc-step1-discrepancy.py -f data/gaussianMixture6__d_2_n_2000_X.dat -g data/gaussianMixture6__d_2_n_2000_Y.dat --directory exple-gaussian --theta0 0.500000 --kofn-power 0.660000 -B 1 > log.txt Running /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-ddbc-2D-colored-embedding.py -f exple-gaussian/gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y All output files: [] ['exple-gaussian/gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y-step2-2Dembedding-discrepancy.png', 'exple-gaussian/gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y-step2-2Dembedding-populations.png']

Figs displayed Marker : Calculation Ended
In [11]:

print("Marker : Calculation Started") step2(odir_gaussian, "gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y.pointd", "gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y.heightDiv", deltamax=0.25) print("Marker : Calculation Ended")
Marker : Calculation Started Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-ddbc-step2-clustering.py Running sbl-ddbc-step2-clustering.py -p exple-gaussian/gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y.pointd -w exple-gaussian/gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y.heightDiv -d exple-gaussian -v -n 6 --rho 0.100000 --deltamax 0.250000 -P > log.txt All output files: ['gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y.2Dpoints\n', 'gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y.heightDiv\n', 'gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y.pointd\n', 'gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y-step2-2Dembedding-discrepancy.png\n', 'gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y-step2-2Dembedding-populations.png\n', 'gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y.vertices\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__disconnectivity_forest.eps\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__MSW_chain_complex.xml\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__nng.xml\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_diagram.pdf\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_diagram.plot\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_diagram.txt\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_histogram.pdf\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_histogram.r\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_histogram.txt\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__stable_manifold_clusters.txt\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__stable_manifold_partition.xml\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__sublevelsets_connected_components.xml\n'] ['exple-gaussian/sbl-Morse-theory-based-analyzer-nng-euclid__disconnectivity_forest.png', 'exple-gaussian/sbl-Morse-theory-based-analyzer-nng-euclid__persistence_diagram.png']

Figs displayed Marker : Calculation Ended
In [12]:

print("Marker : Calculation Started") step3(odir_gaussian, "gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y.vertices", "sbl-Morse-theory-based-analyzer-nng-euclid__sublevelsets_connected_components.xml", "gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y.2Dpoints") print("Marker : Calculation Ended")
Marker : Calculation Started Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-ddbc-step3-cluster-plots.py Running sbl-ddbc-step3-cluster-plots.py -f exple-gaussian/gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y.vertices -c exple-gaussian/sbl-Morse-theory-based-analyzer-nng-euclid__sublevelsets_connected_components.xml -p exple-gaussian/gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y.2Dpoints > log.txt

Figs displayed ['exple-gaussian/gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y-step3-2Dembedding-clusters.png', 'exple-gaussian/gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y-step3-bars.png', 'exple-gaussian/gaussianMixture6__d_2_n_2000_X_gaussianMixture6__d_2_n_2000_Y-step3-discrepancy-cumulative-distribution.png'] Marker : Calculation Ended
Illustration 3: high dim data¶
In the following, we process 2 x 1600 points en dimension 784, from the MNIST dataset

An example illustrating the processing of real data in high dimension, we process a mixture of handwritten digits, each represented as an 28 x 28 gray-scale image -- see the MNIST dataset \cite lecun1998mnist .

The two populations see also \ref fig-digits-in-out , were chosen as follows:

blue population, 1600 digits: 100 digits 3, 500 digits 6, and 1000 digits 8.

red population, 1600 digits: 1000 digits 3, 500 digits 6, ans 100 digits 8.
In [13]:

print("Marker : Calculation Started") odir_mnist = "exple-mnist" step1(odir_mnist, "data/mnist368_100_500_1000_X.dat", "data/mnist368_1000_500_100_Y.dat") print("Marker : Calculation Ended")
Marker : Calculation Started Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-ddbc-step1-discrepancy.py Running sbl-ddbc-step1-discrepancy.py -f data/mnist368_100_500_1000_X.dat -g data/mnist368_1000_500_100_Y.dat --directory exple-mnist --theta0 0.500000 --kofn-power 0.660000 -B 1 > log.txt Running /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-ddbc-2D-colored-embedding.py -f exple-mnist/mnist368_100_500_1000_X_mnist368_1000_500_100_Y All output files: [] ['exple-mnist/mnist368_100_500_1000_X_mnist368_1000_500_100_Y-step2-2Dembedding-discrepancy.png', 'exple-mnist/mnist368_100_500_1000_X_mnist368_1000_500_100_Y-step2-2Dembedding-populations.png']

Figs displayed Marker : Calculation Ended
In [14]:

print("Marker : Calculation Started") step2(odir_mnist, "mnist368_100_500_1000_X_mnist368_1000_500_100_Y.pointd",\ "mnist368_100_500_1000_X_mnist368_1000_500_100_Y.heightDiv", n=30, deltamax=0.34) print("Marker : Calculation Ended")
Marker : Calculation Started Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-ddbc-step2-clustering.py Running sbl-ddbc-step2-clustering.py -p exple-mnist/mnist368_100_500_1000_X_mnist368_1000_500_100_Y.pointd -w exple-mnist/mnist368_100_500_1000_X_mnist368_1000_500_100_Y.heightDiv -d exple-mnist -v -n 30 --rho 0.100000 --deltamax 0.340000 -P > log.txt All output files: ['mnist368_100_500_1000_X_mnist368_1000_500_100_Y.2Dpoints\n', 'mnist368_100_500_1000_X_mnist368_1000_500_100_Y.heightDiv\n', 'mnist368_100_500_1000_X_mnist368_1000_500_100_Y.pointd\n', 'mnist368_100_500_1000_X_mnist368_1000_500_100_Y-step2-2Dembedding-discrepancy.png\n', 'mnist368_100_500_1000_X_mnist368_1000_500_100_Y-step2-2Dembedding-populations.png\n', 'mnist368_100_500_1000_X_mnist368_1000_500_100_Y.vertices\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__disconnectivity_forest.eps\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__MSW_chain_complex.xml\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__nng.xml\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_diagram.pdf\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_diagram.plot\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_diagram.txt\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_histogram.pdf\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_histogram.r\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__persistence_histogram.txt\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__stable_manifold_clusters.txt\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__stable_manifold_partition.xml\n', 'sbl-Morse-theory-based-analyzer-nng-euclid__sublevelsets_connected_components.xml\n'] ['exple-mnist/sbl-Morse-theory-based-analyzer-nng-euclid__disconnectivity_forest.png', 'exple-mnist/sbl-Morse-theory-based-analyzer-nng-euclid__persistence_diagram.png']

Figs displayed Marker : Calculation Ended
In [15]:

print("Marker : Calculation Started") step3(odir_mnist, "mnist368_100_500_1000_X_mnist368_1000_500_100_Y.vertices", "sbl-Morse-theory-based-analyzer-nng-euclid__sublevelsets_connected_components.xml", "mnist368_100_500_1000_X_mnist368_1000_500_100_Y.2Dpoints") print("Marker : Calculation Ended")
Marker : Calculation Started Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-ddbc-step3-cluster-plots.py Running sbl-ddbc-step3-cluster-plots.py -f exple-mnist/mnist368_100_500_1000_X_mnist368_1000_500_100_Y.vertices -c exple-mnist/sbl-Morse-theory-based-analyzer-nng-euclid__sublevelsets_connected_components.xml -p exple-mnist/mnist368_100_500_1000_X_mnist368_1000_500_100_Y.2Dpoints > log.txt

Figs displayed ['exple-mnist/mnist368_100_500_1000_X_mnist368_1000_500_100_Y-step3-2Dembedding-clusters.png', 'exple-mnist/mnist368_100_500_1000_X_mnist368_1000_500_100_Y-step3-bars.png', 'exple-mnist/mnist368_100_500_1000_X_mnist368_1000_500_100_Y-step3-discrepancy-cumulative-distribution.png'] Marker : Calculation Ended

Table of Contents

Density_difference_based_clustering

Goals

Using the programs

Pre-requisites

Step 1

Step 2

Step 3

Plots

Specifications and File Types

Step 1: using sbl-ddbc-step-1-discrepancy.py and sbl-ddbc-2D-colored-embedding.py

Optional step: 2D embedding

Step 2: using sbl-ddbc-step2-clustering.py

Step 3: using sbl-ddbc-step3-cluster-plots.py

Algorithms and Methods

Discrepancy estimation

Persistence analysis

Programmer's Workflow

External dependencies

Jupyter demo

Density_difference_based_clustering¶

Python functions to execute the three steps¶

Step 1 : computing the discrepancy measure -- and a 2D embedding¶

Options¶

Step 2 : Clustering samples via persistence based analysis¶

Options¶

Step 3 : Cluster decomposition¶

Options¶

Illustration 1: ellipsis in 2D¶

Step 1: computing the discrepancy¶

Step 2: persistence based analysis¶

Step 3: visualization of clusters¶

Illustration 2: 2D Gaussian¶

Illustration 3: high dim data¶