Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Cluster_engines

Authors: F. Cazals and T. Dreyfus

Introduction

We provide a generic framework for clustering algorithms, together with specific implementations corresponding to various algorithms, namely

  • k-means. Given a pre-defined number of centers k, k-means partitions the input datasets into clusters – one per center, so as to minimize the sum of squared distances of samples to their nearest center. We provide in particular the smart seeding procedure from [9] .
  • Tomato. Tomato is a clustering algorithm defining clusters from local maxima of an estimated density. The method uses topological persistence to identify the most significant clusters, as detailed in [35] .

Implementation

Terminology

Consider a dataset $ P $ consisting of $ n $ observations (data points or points for short). A clustering is a partition of the input dataset into disjoint subsets called clusters.


A partial clustering is a collection of clusters whose union is strictly contained in the input dataset (e.g. some outliers have been removed).


Design

The package is designed around the module class SBL::Modules::T_Cluster_engine_module : while each engine is implemented in its own base file, the need of a common design for each engine is necessary for using a general cluster module. The two main engines are:

  • K-means, implemented by SBL::GT::T_Cluster_engine_k_means< PointType , VectorType , Distance > ; The parameter PointType is a representation of a geometric point, while the parameter VectorType is a representation of a geometric vector supporting operations like addition and division; the parameter Distance is a functor computing the distance between two points; the CGAL library naturally offers all these types, e.g using the class CGAL::Cartesian_d;

  • Morse theory based algorithm (see package Morse_theory_based_analyzer for details), implemented by SBL::GT::T_Cluster_engine_Morse_theory_based< PointType , Distance , TNNQuery , TGetDensity > ; The parameters PointType and Distance have the exact same sense as previously mentionned; the parameter TNNQuery is the spatial search engine used for building the nearest neighbors graph (NNG); the parameter TNNQuery is a template itself parametrized by a distance functor computing distances between vertices of NNG (this distance data structure is internal to the clustering engine); finally, the parameter TGetDensity is a functor computing the density of a vertex in the NNG; it is also a template parametrized by the type of NNG (which is defined internally to the clustering engine);

Functionnality

  1. Generic input: a file containing points, complying with the Point_d concept (a line list the dimension and then the number of coordinates).

  2. Generic output: a text file giving for each point, the index of its clusters (0..n-1 convention).
Visualization: we also provide a script displaying the clusters.


Examples

The following examples show how to use the K-means algorithm with the different selectors, and how to use the provided module.

Using k-means

This example loads an input set of points and computes 4 clusters using k-means algorithm using the random strategy for selecting the initial centers of mass.

#include <CGAL/Cartesian_d.h>
#include <SBL/GT/Cluster_engine_k_means.hpp>
#include <SBL/Models/Points_d_file_loader.hpp>
typedef CGAL::Cartesian_d<double> K;
typedef K::Squared_distance_d Distance;
int main(int argc, char *argv[])
{
Loader loader;
loader.add_file_name(argv[1]);
loader.load(true);
K_means km;
std::vector<std::size_t> clusters;
km.set_k(4);
km.set_mode("plusplus");
km(loader.get_data().begin(), loader.get_data().end(), std::back_inserter(clusters));
std::ofstream out_clusters("clusters.txt");
km.dump_clusters(clusters.begin(), clusters.end(), out_clusters);
out_clusters.close();
std::ofstream out_centers("centers.txt");
std::vector<K_means::Center_of_mass> centers;
km.get_centers_of_mass(loader.get_data().begin(), loader.get_data().end(), clusters.begin(), clusters.end(), std::back_inserter(centers));
km.dump_centers(centers.begin(), centers.end(), out_centers);
out_centers.close();
std::cout << "Score : " << km.get_score(loader.get_data().begin(), loader.get_data().end(), centers, clusters) << std::endl;
return 0;
}

Using Morse theory based strategy

This example loads an input set of points and computes 4 clusters using k-means algorithm using the random strategy for selecting the initial centers of mass.

#include <CGAL/Cartesian_d.h>
#include "Morse_functions.hpp"
#include <SBL/GT/ANN_metric_space_proximity_tree.hpp>
#include <SBL/GT/ANN_meta.hpp>
#include <SBL/GT/Cluster_engine_Morse_theory_based.hpp>
#include <SBL/Models/Points_d_file_loader.hpp>
typedef CGAL::Cartesian_d<double> K;
typedef K::Squared_distance_d Distance;
template <class DistanceFunction>
class T_NN_query : public SBL::GT::T_ANN_meta<SBL::GT::T_ANN_metric_space_proximity_tree<DistanceFunction> >
{
public:
T_NN_query(DistanceFunction& dist) : Base(dist) {}
};
<K::Point_d,
Distance,
T_NN_query,
int main(int argc, char *argv[])
{
Loader loader;
loader.add_file_name(argv[1]);
loader.load(true);
Cluster_engine::set_number_of_neighbors(8);
Cluster_engine engine;
std::vector<std::size_t> clusters;
engine(loader.get_data().begin(), loader.get_data().end(), std::back_inserter(clusters));
std::ofstream out_clusters("clusters.txt");
engine.dump_clusters(clusters.begin(), clusters.end(), out_clusters);
out_clusters.close();
return 0;
}

Using the module

This example loads an input set of points and instantiates the K-means module for computing an input number of clusters using the default (random) strategy for selecting the initial centers of mass.

#include <CGAL/Cartesian_d.h>
#include <SBL/GT/Cluster_engine_k_means.hpp>
#include <SBL/Modules/Cluster_engine_module.hpp>
#include <SBL/Models/Points_d_file_loader.hpp>
struct Traits
{
typedef CGAL::Cartesian_d<double> K;
typedef K::Point_d Point;
typedef K::Squared_distance_d Distance;
};
int main(int argc, char *argv[])
{
Loader loader;
loader.add_file_name(argv[1]);
loader.load(true);
Cluster_engine_module module;
Traits::Cluster_engine::set_k(atoi(argv[2]));
module.get_points() = &loader.get_data();
module.run(true, std::cout);
module.report("example_");
return 0;
}

Applications

This package offers programs for computing a clustering of a set of points in an Euclidean space, using k-means (sbl-cluster-k-means-euclid.exe) or the Morse theory based strategy (sbl-cluster-MTB-euclid.exe). The last program is a simplified version of programs provided by the package Morse_theory_based_analyzer. For a more complete analysis of a cloud of points, please use the package Morse_theory_based_analyzer.

In addition, this package provides instantitations of the engines. We briefly review these algorithms, and illustrate them on the following dataset. toto

A 2D point set defined by a mixture of 5 gaussians (1000 points in total)

k-means

  • sbl-cluster-k-means-euclid.exe : a program comparing two sets of D-dimensional points using the Euclidean metric and k-means algorithm. Note that the input is a text file listing the points in the Point_d format (dimension followed by coordinates).
k-means minimizes, within each cluster, the sum of squared distances. In doing so, one uses the centroid of each cluster. Note that the centroid is distinct from the point minimizing the sum of distances to sample points, known as the Fermat–Weber point or geometric median, a problem still under scrutiny [40] . The variation of k-means using the median instead of the centroid is known as k-medians.


Example 1: using k-means with random seeding, computing a clustering with 4 clusters of . Then, visualizing the corresponding clusters:

sbl-cluster-k-means-euclid.exe --k-means-k 4 --points-file points-N200-d50.txt --k-means-selector=random
sbl-clusters-display.py -f points-N200-d50.txt -c sbl-cluster-k-means-euclid__clusters.txt -C sbl-cluster-k-means-euclid__centers.txt

Example 2: using k-means with the so-called ++ initialization, computing a clustering with 4 clusters of the same dataset

sbl-cluster-k-means-euclid.exe --k-means-k 4 --points-file points-N200-d50.txt --k-means-selector=plusplus
sbl-clusters-display.py -f points-N200-d50.txt -c sbl-cluster-k-means-euclid__clusters.txt -C sbl-cluster-k-means-euclid__centers.txt

(A)
(B)
(A) The result of k-means with four clusters. Note that the top right cluster has been split into two clusters, due to the initialization with 2 centers in that region. (B) The correct clustering obtained with the plus_plus seeding. NB: the red points in the middle of the clusters are the centers.

Tomato

  • sbl-cluster-MTB-euclid.exe : a program comparing two sets of D-dimensional points using Euclidean metric and the package Morse_theory_based_analyzer. Note that the input is a text file listing the points in the Point_d format (dimension followed by coordinates).

While the details are provided in the package Morse_theory_based_analyzer, the following comments are in order:

  • Tomato performs an analysis of the density estimate by sweeping the height function provided by the density estimate, from top top bottom. A local maximum dies when it merges with another local maximum, upon encountering a pass (an index d-1 saddle) connecting them. Consequently, the death date of a local maximum is less than its birth date. This explains points on the persistence diagram are located below the diagonal y=x.

Example 3: using Tomato at a persistence threshold of p=0.1, computing a clustering of the same dataset:

sbl-cluster-MTB-euclid.exe --points-file points-N200-d50.txt --num-neighbors=10 --persistence-threshold=.1
sbl-clusters-display.py -f points-N200-d50.txt -c sbl-cluster-MTB-euclid__clusters.txt -p sbl-cluster-MTB-euclid__persistence_diagram.txt

(A)
(B)
(A) The clustering into four clusters, corresponding to four local maxima of the density estimate from the points. (B) The persistence diagram showing the persistences of local minima, from which one gets that there are four main local maxima, all located at the right bottom dot in the diagram. Inspecting the text file providing the persistences, it actually turns out that there are four superimposed points on the persistence diagram, corresponding to the four clusters.