Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.

Authors: F. Cazals and T. Dreyfus
We provide a generic framework for clustering algorithms, together with specific implementations corresponding to various algorithms, namely
The package is designed around the module class SBL::Modules::T_Cluster_engine_module : while each engine is implemented in its own base file, the need of a common design for each engine is necessary for using a general cluster module. The two main engines are:
Kmeans, implemented by SBL::GT::T_Cluster_engine_k_means< PointType , VectorType , Distance > ; The parameter PointType is a representation of a geometric point, while the parameter VectorType is a representation of a geometric vector supporting operations like addition and division; the parameter Distance is a functor computing the distance between two points; the CGAL library naturally offers all these types, e.g using the class CGAL::Cartesian_d;
Morse theory based algorithm (see package Morse_theory_based_analyzer for details), implemented by SBL::GT::T_Cluster_engine_Morse_theory_based< PointType , Distance , TNNQuery , TGetDensity > ; The parameters PointType and Distance have the exact same sense as previously mentionned; the parameter TNNQuery is the spatial search engine used for building the nearest neighbors graph (NNG); the parameter TNNQuery is a template itself parametrized by a distance functor computing distances between vertices of NNG (this distance data structure is internal to the clustering engine); finally, the parameter TGetDensity is a functor computing the density of a vertex in the NNG; it is also a template parametrized by the type of NNG (which is defined internally to the clustering engine);
Generic input: a file containing points, complying with the Point_d concept (a line list the dimension and then the number of coordinates).
The following examples show how to use the Kmeans algorithm with the different selectors, and how to use the provided module.
This example loads an input set of points and computes 4 clusters using kmeans algorithm using the random strategy for selecting the initial centers of mass.
This example loads an input set of points and computes 4 clusters using kmeans algorithm using the random strategy for selecting the initial centers of mass.
This example loads an input set of points and instantiates the Kmeans module for computing an input number of clusters using the default (random) strategy for selecting the initial centers of mass.
This package offers programs for computing a clustering of a set of points in an Euclidean space, using kmeans (sblclusterkmeanseuclid.exe) or the Morse theory based strategy (sblclusterMTBeuclid.exe). The last program is a simplified version of programs provided by the package Morse_theory_based_analyzer. For a more complete analysis of a cloud of points, please use the package Morse_theory_based_analyzer.
In addition, this package provides instantitations of the engines. We briefly review these algorithms, and illustrate them on the following dataset. toto
A 2D point set defined by a mixture of 5 gaussians (1000 points in total) 
Example 1: using kmeans with random seeding, computing a clustering with 4 clusters of . Then, visualizing the corresponding clusters:
sblclusterkmeanseuclid.exe kmeansk 4 pointsfile pointsN200d50.txt kmeansselector=random sblclustersdisplay.py f pointsN200d50.txt c sblclusterkmeanseuclid__clusters.txt C sblclusterkmeanseuclid__centers.txt
Example 2: using kmeans with the socalled ++ initialization, computing a clustering with 4 clusters of the same dataset
sblclusterkmeanseuclid.exe kmeansk 4 pointsfile pointsN200d50.txt kmeansselector=plusplus sblclustersdisplay.py f pointsN200d50.txt c sblclusterkmeanseuclid__clusters.txt C sblclusterkmeanseuclid__centers.txt
(A) The result of kmeans with four clusters. Note that the top right cluster has been split into two clusters, due to the initialization with 2 centers in that region. (B) The correct clustering obtained with the plus_plus seeding. NB: the red points in the middle of the clusters are the centers. 
While the details are provided in the package Morse_theory_based_analyzer, the following comments are in order:
Example 3: using Tomato at a persistence threshold of p=0.1, computing a clustering of the same dataset:
sblclusterMTBeuclid.exe pointsfile pointsN200d50.txt numneighbors=10 persistencethreshold=.1 sblclustersdisplay.py f pointsN200d50.txt c sblclusterMTBeuclid__clusters.txt p sblclusterMTBeuclid__persistence_diagram.txt
(A) The clustering into four clusters, corresponding to four local maxima of the density estimate from the points. (B) The persistence diagram showing the persistences of local minima, from which one gets that there are four main local maxima, all located at the right bottom dot in the diagram. Inspecting the text file providing the persistences, it actually turns out that there are four superimposed points on the persistence diagram, corresponding to the four clusters. 