Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Cluster_engines

Authors: F. Cazals and T. Dreyfus

Introduction

We provide a generic framework for clustering algorithms, together with specific implementations corresponding to various algorithms, namely

  • k-means. Given a pre-defined number of centers k, k-means partitions the input datasets into clusters – one per center, so as to minimize the sum of squared distances of samples to their nearest center. We provide in particular the smart seeding procedure from [12] .
  • Tomato. Tomato is a clustering algorithm defining clusters from local maxima of an estimated density. The method uses topological persistence to identify the most significant clusters, as detailed in [52] .

Implementation

Terminology

Consider a dataset $P$ consisting of $n$ observations (data points or points for short). A clustering is a partition of the input dataset into disjoint subsets called clusters.


We represent clusters as follows:

We use the following cluster representation: the clustering of $n$ points into $m$ clusters is specified by mapping each point to an integer in the range $0..m-1$.


A partial clustering is a collection of clusters whose union is strictly contained in the input dataset (e.g. some outliers have been removed).


Design

The package is designed around the module class SBL::Modules::T_Cluster_engine_module : while each engine is implemented in its own base file, the need of a common design for each engine is necessary for using a general cluster module. The two main engines are:

  • K-means, implemented by SBL::GT::T_Cluster_engine_k_means< PointType , VectorType , Distance > ; The parameter PointType is a representation of a geometric point, while the parameter VectorType is a representation of a geometric vector supporting operations like addition and division; the parameter Distance is a functor computing the distance between two points; the CGAL library naturally offers all these types, e.g using the class CGAL::Cartesian_d;

  • Morse theory based algorithm (see package Morse_theory_based_analyzer for details), implemented by SBL::GT::T_Cluster_engine_Morse_theory_based< PointType , Distance , TNNQuery , TGetDensity > ; The parameters PointType and Distance have the exact same sense as previously mentionned; the parameter TNNQuery is the spatial search engine used for building the nearest neighbors graph (NNG); the parameter TNNQuery is a template itself parametrized by a distance functor computing distances between vertices of NNG (this distance data structure is internal to the clustering engine); finally, the parameter TGetDensity is a functor computing the density of a vertex in the NNG; it is also a template parametrized by the type of NNG (which is defined internally to the clustering engine);

    Using these classes is especially easy, via the template class SBL::GT::T_Cluster_engine_workflow.

Functionnality

  1. Generic input: a file containing points, complying with the Point_d concept (a line list the dimension and then the number of coordinates).

  2. Generic output: a text file giving for each point, the index of its clusters (0..n-1 convention).
Visualization: we also provide a script displaying the clusters.


Examples

The following examples show how to use the K-means algorithm with the different selectors, and how to use the provided module.

Using k-means

This example loads an input set of points and computes 4 clusters using k-means algorithm using the random strategy for selecting the initial centers of mass.

Using Morse theory based strategy

This example loads an input set of points and computes 4 clusters using k-means algorithm using the random strategy for selecting the initial centers of mass.

Using the module

This example loads an input set of points and instantiates the K-means module for computing an input number of clusters using the default (random) strategy for selecting the initial centers of mass.

Applications

Programs

This package offers programs for computing a clustering of a set of points:

  • k-means : sbl-cluster-k-means-euclid.exe sbl-cluster-k-means-lrmsd.exe
  • the Morse theory based strategy: sbl-cluster-MTB-euclid.exe

Note that a given executable is qualified by (i) a space and the associated distance, and (ii) a method (k-means or Morse theory based).

Using the aforementioned workflow mechanism, adding new variations is especially easy.

k-means: main specifications

sbl-cluster-k-means-euclid.exe : a program comparing two sets of D-dimensional points using the Euclidean metric and k-means algorithm. Note that the input is a text file listing the points in the Point_d format (dimension followed by coordinates).

Main options. The main options are:

–points-file string: Input points with the d-dimensional point format
-k-means-k string: Number of clusters
–k-means-selector string: k-means strategy i.e. / random (default) / plusplus / minimax
–k-means-itermax string: Maximum number of iterations


Input. The main input consist of the data points and the number of clusters.

Main output. The main files reported are:

  • Log file, txt format, suffix __log.txt : main log file
  • Centers file, point_d format, suffix __centers.txt : coordinates in point_d format of the centers
  • Clusters file, txt format, suffix __clusters.txt : an ordered text file with on each line the index of the cluster the corresponding points belongs to.

Comments.

k-means minimizes, within each cluster, the sum of squared distances. In doing so, one uses the centroid of each cluster. Note that the centroid is distinct from the point minimizing the sum of distances to sample points, known as the Fermat–Weber point or geometric median, a problem still under scrutiny [59] . The variation of k-means using the median instead of the centroid is known as k-medians.


Tomato: main specifications

sbl-cluster-MTB-euclid.exe : a program comparing two sets of D-dimensional points using Euclidean metric and the package Morse_theory_based_analyzer.

Main options. The main options are as follows:

–points-file string: Input points with the d-dimensional point format
–num-neighbors string: Number of neighbors used to define the nearest neighbor graph
–persistence-threshold string: Threshold used to define local maxima whence clusters


Input. The main input is the point set in Point_d format (dimension followed by coordinates).

Main output.

First, one finds the same files as for k-means:

  • Log file, txt format, suffix __log.txt : main log file
  • Centers file, point_d format, suffix __centers.txt : coordinates in point_d format of the centers
  • Clusters file, txt format, suffix __clusters.txt : an ordered text file with on each line the index of the cluster the corresponding points belongs to.

Second, one finds the persistence diagram i.e. the coordinates of critical points:

  • Persistence diagram, txt format, suffix__persistence_diagram.txt: coordinates of points defining the persistence diagram.

Finally, one finds the bipartite graph connecting local maximma and associated saddle points – this graph is known as the Morse-Smale-Witten chain complex:

Optional output.

Comments.

Tomato performs an analysis of the density estimate by sweeping the height function provided by the density estimate, from top top bottom. A local maximum dies when it merges with another local maximum, upon encountering a pass (an index d-1 saddle) connecting them. Consequently, the death date of a local maximum is less than its birth date. This explains points on the persistence diagram are located below the diagonal y=x. We note that Tomato is a simplified version of programs provided by the package Morse_theory_based_analyzer. For a more complete analysis of a cloud of points, please use the package Morse_theory_based_analyzer.


Visualization of clusters in 2D

In two dimensions, one can plot clusters with the script sbl-clusters-display.py.

For the first and second examples above, generated by kmeans (with random or ++ initialization):

sbl-clusters-display.py  -f points-N200-d50.txt  -c sbl-cluster-k-means-euclid__clusters.txt -C sbl-cluster-k-means-euclid__centers.txt

For the third example above, generated by Tomato:

sbl-clusters-display.py -f points-N200-d50.txt -c sbl-cluster-MTB-euclid__clusters.txt -p sbl-cluster-MTB-euclid__persistence_diagram.txt

Jupyter demo

See the following jupyter notebook:

  • Jupyter notebook file
  • Cluster_engines

    Cluster_engines

    Useful functions

    In [1]:
    from SBL import sbl_jupyter as sblj
    help(sblj)
    
    Help on module SBL.sbl_jupyter in SBL:
    
    NAME
        SBL.sbl_jupyter
    
    DESCRIPTION
        ## Load as follows
        #from SBL import sbl_jupyter as sblj
        #help(sblj)
        #
        ## Use as follows
        # sblj.SBLjupyter.find_and_show_images(".png", odir, 50)
    
    CLASSES
        builtins.object
            tools
        
        class tools(builtins.object)
         |  Static methods defined here:
         |  
         |  convert_eps_to_png(ifname, osize)
         |  
         |  convert_pdf_to_png(ifname, osize)
         |  
         |  find_and_convert(suffix, odir, osize)
         |      # find file with suffix, convert, and return image file
         |  
         |  find_and_show_images(suffix, odir, osize)
         |  
         |  find_file_in_output_directory(suffix, odir)
         |  
         |  show_eps_file(eps_file, osize)
         |  
         |  show_image(img)
         |  
         |  show_log_file(odir)
         |  
         |  show_pdf_file(pdf_file)
         |  
         |  show_row_n_images(images, size)
         |  
         |  ----------------------------------------------------------------------
         |  Data descriptors defined here:
         |  
         |  __dict__
         |      dictionary for instance variables (if defined)
         |  
         |  __weakref__
         |      list of weak references to the object (if defined)
    
    FILE
        /user/fcazals/home/projects/proj-soft/sbl/python/SBL/sbl_jupyter.py
    
    
    

    Part I : k-means

    Options.

    The main options of the runKMeans method in the next cell are:

    • points: Text file listing D-dimensional points
    • k: number of output clusters
    • selector: selection of initial seeds (random, plusplus or minimax)
    • itermax: maximal number of iterations
    In [2]:
    import re  #regular expressions
    import sys #misc system
    import os
    import pdb
    import shutil # python 3 only
    
    odir = "tmp-results-kmeans-euclid"
    if os.path.exists(odir):
        os.system("rm -rf %s" % odir)
    os.system( ("mkdir %s" % odir) )
    
    def run_k_means(points, k, selector = "plusplus", itermax = 10): 
           
        # check executable exists and is visible
        exe = shutil.which("sbl-cluster-k-means-euclid.exe")
        if exe:
            print(("Using executable %s\n" % exe))
            cmd = "sbl-cluster-k-means-euclid.exe -v -l -d %s --points-file %s --k-means-k %d\
                  --k-means-selector %s --k-means-itermax %d" %\
            (odir, points, k, selector, itermax)
            print("Running:", cmd)
            os.system(cmd)
            
            cmd = "ls %s" % odir
            ofnames = os.popen(cmd).readlines()
            print("All output files:",ofnames)
            
            #find the log file and display log file
            sblj.tools.show_log_file(odir)
            
            exed = shutil.which("sbl-clusters-display.py")
            if exed:
                cmd = "%s -f %s  \
                    -c %s/sbl-cluster-k-means-euclid__clusters.txt  -C %s/sbl-cluster-k-means-euclid__centers.txt -o %s" % (exed, points,odir,odir,odir)
                print("\n Display command",cmd)
                os.system(cmd)
            
        else:
            print("Executable not found")
            
    
    In [3]:
    ifiles = ["data/points-N100-d2.txt", "data/points-N100-d4.txt", "data/points-N100-d6.txt", "data/points-N200-d50.txt"]
    #ifiles = [ "data/points-N200-d50.txt"]
    
    for ifile in ifiles:
        print("\nMarker : Calculation Started")
        run_k_means(ifile, 4)
        print("Marker : Calculation Ended")
    
    Marker : Calculation Started
    Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-cluster-k-means-euclid.exe
    
    Running: sbl-cluster-k-means-euclid.exe -v -l -d tmp-results-kmeans-euclid --points-file data/points-N100-d2.txt --k-means-k 4              --k-means-selector plusplus --k-means-itermax 10
    All output files: ['sbl-cluster-k-means-euclid__centers.txt\n', 'sbl-cluster-k-means-euclid__clusters.txt\n', 'sbl-cluster-k-means-euclid__log.txt\n']
    Log file is: tmp-results-kmeans-euclid/sbl-cluster-k-means-euclid__log.txt
    Running:  sbl-cluster-k-means-euclid.exe -v -l -d tmp-results-kmeans-euclid --points-file data/points-N100-d2.txt --k-means-k 4 --k-means-selector plusplus --k-means-itermax 10
    
    D-points Loader
    Statistics:
    Number of loaded data: 1
    -- Number of loaded points in ensemble: 500
    
    Cluster Engine
    Statistics:
    -- Number of clusters : 4
    -- Size of clusters : (0, 125) (1, 70) (2, 142) (3, 163)
    
    Report...
    
    End Run
    
    General Statistics:
    
    Times elapsed for computations (in seconds):
    -- Cluster Engine: 0.001969
    Total: 0.001969
    
    
     Display command /user/fcazals/home/projects/proj-soft/sbl-install/scripts/sbl-clusters-display.py -f data/points-N100-d2.txt                  -c tmp-results-kmeans-euclid/sbl-cluster-k-means-euclid__clusters.txt  -C tmp-results-kmeans-euclid/sbl-cluster-k-means-euclid__centers.txt -o tmp-results-kmeans-euclid
    Marker : Calculation Ended
    
    Marker : Calculation Started
    Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-cluster-k-means-euclid.exe
    
    Running: sbl-cluster-k-means-euclid.exe -v -l -d tmp-results-kmeans-euclid --points-file data/points-N100-d4.txt --k-means-k 4              --k-means-selector plusplus --k-means-itermax 10
    All output files: ['points-N100-d2--sbl-cluster-k-means-euclid__clusters--sbl-cluster-k-means-euclid__centers.png\n', 'sbl-cluster-k-means-euclid__centers.txt\n', 'sbl-cluster-k-means-euclid__clusters.txt\n', 'sbl-cluster-k-means-euclid__log.txt\n']
    Log file is: tmp-results-kmeans-euclid/sbl-cluster-k-means-euclid__log.txt
    Running:  sbl-cluster-k-means-euclid.exe -v -l -d tmp-results-kmeans-euclid --points-file data/points-N100-d4.txt --k-means-k 4 --k-means-selector plusplus --k-means-itermax 10
    
    D-points Loader
    Statistics:
    Number of loaded data: 1
    -- Number of loaded points in ensemble: 500
    
    Cluster Engine
    Statistics:
    -- Number of clusters : 4
    -- Size of clusters : (0, 189) (1, 159) (2, 68) (3, 84)
    
    Report...
    
    End Run
    
    General Statistics:
    
    Times elapsed for computations (in seconds):
    -- Cluster Engine: 0.002000
    Total: 0.002000
    
    
     Display command /user/fcazals/home/projects/proj-soft/sbl-install/scripts/sbl-clusters-display.py -f data/points-N100-d4.txt                  -c tmp-results-kmeans-euclid/sbl-cluster-k-means-euclid__clusters.txt  -C tmp-results-kmeans-euclid/sbl-cluster-k-means-euclid__centers.txt -o tmp-results-kmeans-euclid
    Marker : Calculation Ended
    
    Marker : Calculation Started
    Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-cluster-k-means-euclid.exe
    
    Running: sbl-cluster-k-means-euclid.exe -v -l -d tmp-results-kmeans-euclid --points-file data/points-N100-d6.txt --k-means-k 4              --k-means-selector plusplus --k-means-itermax 10
    All output files: ['points-N100-d2--sbl-cluster-k-means-euclid__clusters--sbl-cluster-k-means-euclid__centers.png\n', 'points-N100-d4--sbl-cluster-k-means-euclid__clusters--sbl-cluster-k-means-euclid__centers.png\n', 'sbl-cluster-k-means-euclid__centers.txt\n', 'sbl-cluster-k-means-euclid__clusters.txt\n', 'sbl-cluster-k-means-euclid__log.txt\n']
    Log file is: tmp-results-kmeans-euclid/sbl-cluster-k-means-euclid__log.txt
    Running:  sbl-cluster-k-means-euclid.exe -v -l -d tmp-results-kmeans-euclid --points-file data/points-N100-d6.txt --k-means-k 4 --k-means-selector plusplus --k-means-itermax 10
    
    D-points Loader
    Statistics:
    Number of loaded data: 1
    -- Number of loaded points in ensemble: 500
    
    Cluster Engine
    Statistics:
    -- Number of clusters : 4
    -- Size of clusters : (0, 146) (1, 149) (2, 126) (3, 79)
    
    Report...
    
    End Run
    
    General Statistics:
    
    Times elapsed for computations (in seconds):
    -- Cluster Engine: 0.002456
    Total: 0.002456
    
    
     Display command /user/fcazals/home/projects/proj-soft/sbl-install/scripts/sbl-clusters-display.py -f data/points-N100-d6.txt                  -c tmp-results-kmeans-euclid/sbl-cluster-k-means-euclid__clusters.txt  -C tmp-results-kmeans-euclid/sbl-cluster-k-means-euclid__centers.txt -o tmp-results-kmeans-euclid
    Marker : Calculation Ended
    
    Marker : Calculation Started
    Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-cluster-k-means-euclid.exe
    
    Running: sbl-cluster-k-means-euclid.exe -v -l -d tmp-results-kmeans-euclid --points-file data/points-N200-d50.txt --k-means-k 4              --k-means-selector plusplus --k-means-itermax 10
    All output files: ['points-N100-d2--sbl-cluster-k-means-euclid__clusters--sbl-cluster-k-means-euclid__centers.png\n', 'points-N100-d4--sbl-cluster-k-means-euclid__clusters--sbl-cluster-k-means-euclid__centers.png\n', 'points-N100-d6--sbl-cluster-k-means-euclid__clusters--sbl-cluster-k-means-euclid__centers.png\n', 'sbl-cluster-k-means-euclid__centers.txt\n', 'sbl-cluster-k-means-euclid__clusters.txt\n', 'sbl-cluster-k-means-euclid__log.txt\n']
    Log file is: tmp-results-kmeans-euclid/sbl-cluster-k-means-euclid__log.txt
    Running:  sbl-cluster-k-means-euclid.exe -v -l -d tmp-results-kmeans-euclid --points-file data/points-N200-d50.txt --k-means-k 4 --k-means-selector plusplus --k-means-itermax 10
    
    D-points Loader
    Statistics:
    Number of loaded data: 1
    -- Number of loaded points in ensemble: 1000
    
    Cluster Engine
    Statistics:
    -- Number of clusters : 4
    -- Size of clusters : (0, 400) (1, 199) (2, 200) (3, 201)
    
    Report...
    
    End Run
    
    General Statistics:
    
    Times elapsed for computations (in seconds):
    -- Cluster Engine: 0.001886
    Total: 0.001886
    
    
     Display command /user/fcazals/home/projects/proj-soft/sbl-install/scripts/sbl-clusters-display.py -f data/points-N200-d50.txt                  -c tmp-results-kmeans-euclid/sbl-cluster-k-means-euclid__clusters.txt  -C tmp-results-kmeans-euclid/sbl-cluster-k-means-euclid__centers.txt -o tmp-results-kmeans-euclid
    Marker : Calculation Ended
    

    Let us display the clusters

    In [4]:
    odir = "tmp-results-kmeans-euclid"
    sblj.tools.find_and_show_images(".png", odir, 50)
    
    Figs displayed
    

    Part II : Morse theory based clustering

    Options.

    The main options of the runMTB method in the next cell are:

    • points: Text file listing D-dimensional points
    • nbNeighbors: number of neighbors for the NNG computation
    • distance: distance range for the nearest neighbor graph (NNG) computation
    • persistence: persistence threshold
    In [5]:
    import sys #misc system
    import os
    import pdb
    import shutil # python 3 only
    
    odir = "tmp-results-MTB-euclid"
    if os.path.exists(odir):
        os.system("rm -rf %s" % odir)
    os.system( ("mkdir %s" % odir) )
    
    
    def run_MTB(points, nbNeighbors = 8, distance = None, persistence = -1): 
       
        
        # check executable exists and is visible
        exe = shutil.which("sbl-cluster-MTB-euclid.exe")
        if exe:
            print(("Using executable %s\n" % exe))
            if distance:
                cmd = "sbl-cluster-MTB-euclid.exe -v -l -d %s --points-file %s\
                      --distance-range %f --persistence-threshold %f" %\
                      (odir, points, distance, persistence)
            else:
                cmd = "sbl-cluster-MTB-euclid.exe -v -l -d %s --points-file %s --num-neighbors %d\
                      --persistence-threshold %f" %\
                      (odir, points, nbNeighbors, persistence)
            print("Running:", cmd)
            os.system(cmd)
            
            cmd = "ls %s" % odir
            ofnames = os.popen(cmd).readlines()
            print("All output files:",ofnames)
            
            #find the log file and display log file
            cmd = "find %s -name \"*log.txt\"" % odir
            lines = os.popen(cmd).readlines()
            if len(lines) > 0:
                lfname = lines[0].rstrip()
                print("Log file is:", lfname)
                log = open(lfname).readlines()
                for line in log:         print(line.rstrip())
            
                
            exed = shutil.which("sbl-clusters-display.py")
            if exed:
                cmd = "%s -f %s  \
                    -c %s/sbl-cluster-MTB-euclid__clusters.txt  \
                    -C %s/sbl-cluster-MTB-euclid__centers.txt \
                    -p %s/sbl-cluster-MTB-euclid__persistence_diagram.txt -o %s" % (exed, points,odir,odir,odir,odir)
     
                print("\n Display command",cmd)
                os.system(cmd)
        else:
            print("Executable not found")
    
    In [6]:
    ifiles = ["data/points-N100-d2.txt", "data/points-N100-d4.txt", "data/points-N100-d6.txt", "data/points-N200-d50.txt"]
    #ifiles = ["data/points-N100-d2.txt"]
    
    for ifile in ifiles:
        print("Marker : Calculation Started")
        run_MTB(ifile, persistence = 0.01)      
        print("Marker : Calculation Ended")
    
    Marker : Calculation Started
    Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-cluster-MTB-euclid.exe
    
    Running: sbl-cluster-MTB-euclid.exe -v -l -d tmp-results-MTB-euclid --points-file data/points-N100-d2.txt --num-neighbors 8                  --persistence-threshold 0.010000
    All output files: ['sbl-cluster-MTB-euclid__centers.txt\n', 'sbl-cluster-MTB-euclid__clusters.txt\n', 'sbl-cluster-MTB-euclid__log.txt\n', 'sbl-cluster-MTB-euclid__msw_edges.txt\n', 'sbl-cluster-MTB-euclid__msw_points.txt\n', 'sbl-cluster-MTB-euclid__msw_weights.txt\n', 'sbl-cluster-MTB-euclid__persistence_diagram.txt\n']
    Log file is: tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__log.txt
    Running:  sbl-cluster-MTB-euclid.exe -v -l -d tmp-results-MTB-euclid --points-file data/points-N100-d2.txt --num-neighbors 8 --persistence-threshold 0.010000
    
    D-points Loader
    Statistics:
    Number of loaded data: 1
    -- Number of loaded points in ensemble: 500
    
    Cluster Engine
    Statistics:
    -- Number of clusters : 5
    -- Size of clusters : (0, 16) (1, 463) (2, 18) (3, 2) (4, 1)
    
    Report...
    
    End Run
    
    General Statistics:
    
    Times elapsed for computations (in seconds):
    -- Cluster Engine: 0.037940
    Total: 0.037940
    
    
     Display command /user/fcazals/home/projects/proj-soft/sbl-install/scripts/sbl-clusters-display.py -f data/points-N100-d2.txt                  -c tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__clusters.txt                  -C tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__centers.txt                 -p tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__persistence_diagram.txt -o tmp-results-MTB-euclid
    Marker : Calculation Ended
    Marker : Calculation Started
    Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-cluster-MTB-euclid.exe
    
    Running: sbl-cluster-MTB-euclid.exe -v -l -d tmp-results-MTB-euclid --points-file data/points-N100-d4.txt --num-neighbors 8                  --persistence-threshold 0.010000
    All output files: ['points-N100-d2--sbl-cluster-MTB-euclid__clusters--persistences.png\n', 'points-N100-d2--sbl-cluster-MTB-euclid__clusters--sbl-cluster-MTB-euclid__centers.png\n', 'sbl-cluster-MTB-euclid__centers.txt\n', 'sbl-cluster-MTB-euclid__clusters.txt\n', 'sbl-cluster-MTB-euclid__log.txt\n', 'sbl-cluster-MTB-euclid__msw_edges.txt\n', 'sbl-cluster-MTB-euclid__msw_points.txt\n', 'sbl-cluster-MTB-euclid__msw_weights.txt\n', 'sbl-cluster-MTB-euclid__persistence_diagram.txt\n']
    Log file is: tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__log.txt
    Running:  sbl-cluster-MTB-euclid.exe -v -l -d tmp-results-MTB-euclid --points-file data/points-N100-d4.txt --num-neighbors 8 --persistence-threshold 0.010000
    
    D-points Loader
    Statistics:
    Number of loaded data: 1
    -- Number of loaded points in ensemble: 500
    
    Cluster Engine
    Statistics:
    -- Number of clusters : 5
    -- Size of clusters : (0, 487) (1, 10) (2, 1) (3, 1) (4, 1)
    
    Report...
    
    End Run
    
    General Statistics:
    
    Times elapsed for computations (in seconds):
    -- Cluster Engine: 0.032958
    Total: 0.032958
    
    
     Display command /user/fcazals/home/projects/proj-soft/sbl-install/scripts/sbl-clusters-display.py -f data/points-N100-d4.txt                  -c tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__clusters.txt                  -C tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__centers.txt                 -p tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__persistence_diagram.txt -o tmp-results-MTB-euclid
    Marker : Calculation Ended
    Marker : Calculation Started
    Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-cluster-MTB-euclid.exe
    
    Running: sbl-cluster-MTB-euclid.exe -v -l -d tmp-results-MTB-euclid --points-file data/points-N100-d6.txt --num-neighbors 8                  --persistence-threshold 0.010000
    All output files: ['points-N100-d2--sbl-cluster-MTB-euclid__clusters--persistences.png\n', 'points-N100-d2--sbl-cluster-MTB-euclid__clusters--sbl-cluster-MTB-euclid__centers.png\n', 'points-N100-d4--sbl-cluster-MTB-euclid__clusters--persistences.png\n', 'points-N100-d4--sbl-cluster-MTB-euclid__clusters--sbl-cluster-MTB-euclid__centers.png\n', 'sbl-cluster-MTB-euclid__centers.txt\n', 'sbl-cluster-MTB-euclid__clusters.txt\n', 'sbl-cluster-MTB-euclid__log.txt\n', 'sbl-cluster-MTB-euclid__msw_edges.txt\n', 'sbl-cluster-MTB-euclid__msw_points.txt\n', 'sbl-cluster-MTB-euclid__msw_weights.txt\n', 'sbl-cluster-MTB-euclid__persistence_diagram.txt\n']
    Log file is: tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__log.txt
    Running:  sbl-cluster-MTB-euclid.exe -v -l -d tmp-results-MTB-euclid --points-file data/points-N100-d6.txt --num-neighbors 8 --persistence-threshold 0.010000
    
    D-points Loader
    Statistics:
    Number of loaded data: 1
    -- Number of loaded points in ensemble: 500
    
    Cluster Engine
    Statistics:
    -- Number of clusters : 4
    -- Size of clusters : (0, 497) (1, 1) (2, 1) (3, 1)
    
    Report...
    
    End Run
    
    General Statistics:
    
    Times elapsed for computations (in seconds):
    -- Cluster Engine: 0.032477
    Total: 0.032477
    
    
     Display command /user/fcazals/home/projects/proj-soft/sbl-install/scripts/sbl-clusters-display.py -f data/points-N100-d6.txt                  -c tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__clusters.txt                  -C tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__centers.txt                 -p tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__persistence_diagram.txt -o tmp-results-MTB-euclid
    Marker : Calculation Ended
    Marker : Calculation Started
    Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-cluster-MTB-euclid.exe
    
    Running: sbl-cluster-MTB-euclid.exe -v -l -d tmp-results-MTB-euclid --points-file data/points-N200-d50.txt --num-neighbors 8                  --persistence-threshold 0.010000
    All output files: ['points-N100-d2--sbl-cluster-MTB-euclid__clusters--persistences.png\n', 'points-N100-d2--sbl-cluster-MTB-euclid__clusters--sbl-cluster-MTB-euclid__centers.png\n', 'points-N100-d4--sbl-cluster-MTB-euclid__clusters--persistences.png\n', 'points-N100-d4--sbl-cluster-MTB-euclid__clusters--sbl-cluster-MTB-euclid__centers.png\n', 'points-N100-d6--sbl-cluster-MTB-euclid__clusters--persistences.png\n', 'points-N100-d6--sbl-cluster-MTB-euclid__clusters--sbl-cluster-MTB-euclid__centers.png\n', 'sbl-cluster-MTB-euclid__centers.txt\n', 'sbl-cluster-MTB-euclid__clusters.txt\n', 'sbl-cluster-MTB-euclid__log.txt\n', 'sbl-cluster-MTB-euclid__msw_edges.txt\n', 'sbl-cluster-MTB-euclid__msw_points.txt\n', 'sbl-cluster-MTB-euclid__msw_weights.txt\n', 'sbl-cluster-MTB-euclid__persistence_diagram.txt\n']
    Log file is: tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__log.txt
    Running:  sbl-cluster-MTB-euclid.exe -v -l -d tmp-results-MTB-euclid --points-file data/points-N200-d50.txt --num-neighbors 8 --persistence-threshold 0.010000
    
    D-points Loader
    Statistics:
    Number of loaded data: 1
    -- Number of loaded points in ensemble: 1000
    
    Cluster Engine
    Statistics:
    -- Number of clusters : 15
    -- Size of clusters : (0, 193) (1, 199) (2, 192) (3, 400) (4, 3) (5, 2) (6, 2) (7, 2) (8, 1) (9, 1) (10, 1) (11, 1) (12, 1) (13, 1) (14, 1)
    
    Report...
    
    End Run
    
    General Statistics:
    
    Times elapsed for computations (in seconds):
    -- Cluster Engine: 0.068028
    Total: 0.068028
    
    
     Display command /user/fcazals/home/projects/proj-soft/sbl-install/scripts/sbl-clusters-display.py -f data/points-N200-d50.txt                  -c tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__clusters.txt                  -C tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__centers.txt                 -p tmp-results-MTB-euclid/sbl-cluster-MTB-euclid__persistence_diagram.txt -o tmp-results-MTB-euclid
    Marker : Calculation Ended
    

    Let us display the clusters together with the persistence diagrams (PD)

    For PD, note that points are below the diagoal: one cluster is defined as a mode of the dentisy estimated associated with the point cloud, so that we study the stable super-level sets of this estimated density.

    In [7]:
    odir = "tmp-results-MTB-euclid"
    sblj.tools.find_and_show_images("euclid__centers.png", odir, 50)
    sblj.tools.find_and_show_images("clusters--persistences.png", odir, 50)
    
    Figs displayed
    Figs displayed