Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

D_family_matching

Authors: R. Tetley and F. Cazals and D. Mazauric

Introduction

While clustering is ubiquitous in data science, with selected methods provided in the package Cluster_engines, picking a particular clustering algorithm and tuning its parameters is in general a non trivial task.

In order to alleviate this task, this package provides methods to compare two clusterings, by computing a mapping between the clusters of these clusterings. In doing so, groups of clusters called meta-clusters are formed within each clustering, and a 1-to-1 correspondence between these meta-clusters is provided. (We note in passing that our mapping goes beyond a 1-to-1 matching between the clusters [92], [55] .)

The following comments are in order:

  • It has been noticed [134] that In fact, the right number of clusters in a dataset often depends on the scale at which the dataset is inspected. In this package, we provide algorithms which are parameterized by a scale parameter $D$, namely the maximum diameter of the subgraph connecting clusters to define meta-clusters.
  • Various cluster comparison measures have been proposed, and we refer the reader to [35] for a thorough review of these. The main drawback of these methods are that (i) they typically return a number qualifying the coherence between the clusterings, without any mapping the clusters with one another, and (ii) they do not accommodate any scale parameter.

In the sequel, we formalize the clustering comparison problem in a manner answering the two needs just discussed, and document the classes provided. For comparison purposes, this package also provides an implementation of the variation of information (VI) [104] , a metric based on information theoretical concepts.

Algorithms

Terminology and pre-requisites

Consider a dataset $P$ consisting of $t$ observations (data points or points for short). A clustering is a partition of the input dataset into disjoint subsets called clusters.


A partial clustering is a collection of clusters whose union is strictly contained in the input dataset (e.g. some outliers have been removed).


Consider a dataset $P$ and two partial clusterings $F = \{F_{1}, \ldots, F_{r} \}$ and $F' = \{F'_{1}, \ldots, F'_{r'} \}$ of $P$. The edge-weighted intersection graph $G = (U, U', E, w)$ is then constructed as follows. The set $U = \{u_{1}, \dots, u_{r}\}$ corresponds to the clustering $F$. To each vertex $u_{i}$, we associate the set $F_{i} \in F$. The set $U' = \{u'_{1}, \dots, u'_{r'}\}$ corresponds to the clustering $F'$. To each vertex $u'_{i}$, we associate the set $F'_{i} \in F$. The set of edges of $G$ is $E = \{\{u_{i},u'_{j}\} \mid F_{i} \cap F'_{j} \neq \emptyset, 1 \leq i \leq r, 1 \leq j \leq r'\}$. The weight of any edge $e = \{u_{i}, u'_{j}\} \in E$ is $w_{e} = |F_{i} \cap F'_{j}|$. The number of edges and vertices of the intersection graph are respectively denoted $n$ and $m$.


Consider two clusterings $F = \{F_{1}, \ldots, F_{r} \}$ and $F' = \{F'_{1}, \ldots, F'_{r'} \}$ of an identical dataset $P$. To a given cluster $F_i$ we associate $\mathcal{P}(i) =\frac{t_i}{t}$ where $t_i = |F_i|$ and $t = |P|$. The entropy associated with clustering $F$ is defined as $\mathcal{H} = -\sum\limits_{i=1}^r\mathcal{P}(i)\log{P(i)}$. Let $F_i \in F$ and $F'_j \in F'$, define $\mathcal{P}(i, j) = \frac{|F_i \cup F'_j|}{t}$. The mutual information between $F$ and $F'$ is defined as $\mathcal{I}(F, F') = \sum\limits_{i=1}^r\sum\limits_{j=1}^{r'}\mathcal{P}(i, j)\log{\frac{\mathcal{P}(i, j)}{\mathcal{P}(i)\mathcal{P}(j)}}$. From this, we define the variation of information between $F$ and $F'$ as $\mathcal{VI}(F, F') = \mathcal{H}(F) + \mathcal{H}(F') - 2\mathcal{I}(F, F')$.


Problem statement

Let $D \in \mathbb{N}^{+}$. Let $G = (V,E,w)$ be an intersection graph. A $D-\FM$ for $G$ is a family $\mathcal{S} = \{S_{1}, \ldots, S_{k}\}, k \geq 1$, such that, for every $i,j \in \{1, \ldots, k\}$, if $i \neq j$, then: $S_{i} \subseteq V, S_{i} \neq \emptyset, S_{i} \cap S_{j} = \emptyset$ and the graph $G[S_{i}]$ induced by the set of nodes $S_{i}$ has diameter at most $D$.


The score $\optimal(\mathcal{S})$ of a $D$-family-matching $\mathcal{S}$ is $\optimal(\mathcal{S}) = \sum\limits_{i=1}^{k}~ \sum\limits_{e \in E(G[S_{i}])} w_{e}$.

Let $\mathcal{S}_{D}(G)$ be the set of all $D-\FM~$ for $G$. Let $D \in \mathbb{N}^{+}$. Given an intersection graph $G$, the $D-\FM~problem$ consists in computing $\scoreOpt{D}{G} = \max_{\mathcal{S} \in \mathcal{S}_{D}(G)} \optimal(\mathcal{S})$.


Intuitively, the D-family-matching problem involves computing disjoint subset of nodes (clusters of clusters, or meta-clusters) such that the inconsitencies are minimized (clusters which are in the same meta-cluster have a minimum number of divergent points).

Problem hardness

The decision version of the $D-\FM$ problem is NP-complete for:

  • bipartite graphs of maximum degree $3$
  • bipartite graphs of maximum degree $4$ even if the maximum weight is constant.

The $2-\FM$ problem is NP-complete for bipartite graphs of maximum degree $3$ and with unary weights.

There exist polynomial time algorithms for certain classes of graphs. Denoting $n$ and $m$ the number of vertices and edges of the intersection graph, the following is proved in [35] :

  • For a tree $T$, the $D-\FM$ problem can be solved in $O(D^2\Delta^2n)$ time (where $\Delta$ is the maximum degree of $T$).
  • For a path $P$, the $D-\FM$ problem can be solved in $O(Dn)$ time.
  • For a cycle $C$ (or any graph of maximum degree 2), the $D-\FM$ problem can be solved in $O(D^2n)$ time.

Algorithms

The implemented algorithms follows the generic design presented in [35] .

\begin{algorithm} \begin{algorithmic}[1] \REQUIRE{An intersection graph $G = (V,E,w)$, an integer $D \geq 1$, a property $\Pi$, a spanning tree generator $\mathcal{R}$, and an algorithm~$\mathcal{A}$.} \STATE $\mathcal{M} := \emptyset$, $\lambda := 0$ \WHILE{$\neg~\Pi(\mathcal{M})$} \STATE $\lambda : = \lambda + 1$; Compute the spanning tree $T^{\lambda} := \mathcal{R}(G,\lambda)$ \STATE Compute $\mathcal{S}^{\lambda}$ by using Algorithm~$\mathcal{A}(G,T^{\lambda},D)$; $\mathcal{M} := \mathcal{M} \cup \mathcal{S}^{\lambda}$ \ENDWHILE \RETURN $\mathcal{S} \in \mathcal{M}$ of maximum score \end{algorithmic} \caption{Generic algorithm for the $D$-\FM~problem.} \end{algorithm}

We implemented a version of the previous algorithm called $\stsGD$ (for Spanning Tree Solver):

  • the spanning tree generator $\mathcal{R}$ returns a maximum spaning tree, or a random spanning tree ;
  • the property $\Pi(\mathcal{M})$ returns true once we have computed a solution on the maximum spanning tree, as well as a solution on $n_i$ distinct random spanning trees (for a given $n_i$) or if the $p$ first distinct solutions have a score contained in an interval of size $\epsilon$.
  • $\mathcal{A}$ is the polynomial time algorithm on trees described in [35] with an additional step: edges for which both extremities belong to the same meta-cluster (and previously unaccounted for) are added to the said meta-cluster.

Implementation and functionalities

Design

The package is centered around the generic the algorithm discussed above.

  • We provide two module classes (for modules, see Module_base ): SBL::Modules::T_RST_MST_D_Family_matching_module< ModuleTraits > is the $\stsGD$ algorithm packaged in a module, SBL::Modules::T_D_Family_matching_module< Graph , TSolutionComputer > lets the user decide which solution computer to use (can be a custom one or any of the ones provided).
  • The class SBL::CADS::T_VI_computer< Clustering > provides a functor for computing the so called variation of information.

Functionalities

  • Generic input: two text files giving for each point the index of its clusters (as outputed by Cluster_engines)
  • D-family-matching output: for each clustering, a .txt file mapping each original cluster to its meta-cluster, a .dot file containing the intersection graph as well as the solution highlighted in red, a .xml file containing various statistics on all the computed solutions (number of meta-clusters, score, size, etc...)
Visualization: we also provide scripts for displaying the meta-clusters on the original point clouds.


Examples

This example loads two input clusterings and computes their 3-family matching using random spanning trees or a minimum spanning tree.

#include <SBL/CADS/Solution_computers.hpp>
#include <SBL/IO/Clustering_loader_with_graph.hpp>
// An individual element of a cluster
struct Cluster_element
{
unsigned m_element_number;
Cluster_element(){};
Cluster_element(unsigned number) : m_element_number(number) {};
unsigned element_number() {return this->m_element_number; };
unsigned element_number() const {return this->m_element_number; };
template <class Archive>
void serialize(Archive &ar, const unsigned BOOST_PFTO int version)
{
ar & boost::serialization::make_nvp("element", this->m_element_number);
};
};
// A cluster type
typedef std::vector<Cluster_element> Cluster;
// A clustering
typedef std::vector<Cluster> Clustering;
// Property for the vertices of the graph
struct Vertex_property
{
//cluster associated with the vertex
Cluster* cluster;
//index of the cluster (in the given clustering)
unsigned cluster_index;
//index of the meta cluster
int meta_index;
//part of the graph which this vertex belongs to
unsigned part;
};
// The weights correspond to the size of the cluster intersections
typedef boost::property<boost::edge_weight_t, int> Edge_weight_property;
//Weighted graph of intersections
typedef boost::adjacency_list<boost::vecS,
boost::vecS,
boost::undirectedS,
Vertex_property,
Edge_weight_property> Intersection_graph;
typedef SBL::IO::T_Clustering_loader<Intersection_graph, Clustering> Clustering_loader;
typedef SBL::CADS::T_MST_Solution_computer<Intersection_graph> MST_Solution_computer;
typedef SBL::CADS::T_RST_Solution_computer<Intersection_graph> RST_Solution_computer;
typedef typename RST_Solution_computer::Solution_type RST_Solution_type;
typedef typename MST_Solution_computer::Meta_clusters_iterator MST_meta_cluster_iterator;
typedef typename MST_Solution_computer::Clusters_iterator MST_cluster_iterator;
typedef typename RST_Solution_computer::Meta_clusters_iterator RST_meta_cluster_iterator;
typedef typename RST_Solution_computer::Clusters_iterator RST_cluster_iterator;
int main(int argc, char *argv[])
{
if (argc != 3)
{
std::cout << "Provide two clusterings for comparison!"<< std::endl;
std::exit(EXIT_FAILURE);
}
// load the clusterings and build the graph (done within the loader class)
Clustering_loader loader;
loader.add_file_name(argv[1]);
loader.add_file_name(argv[2]);
loader.load(true, std::cout);
Intersection_graph* g = loader.get_graph();
// Build the MST solution computer and compute a solution
MST_Solution_type mst_solution;
MST_Solution_computer mst(g);
unsigned mst_score = mst(3, mst_solution);
// Build the RST solution computer and compute a solution
RST_Solution_type rst_solutions;
RST_Solution_computer rst(g);
unsigned rst_score = rst(3, rst_solutions);
std::cout << "MST 3-family matching: \nScore: " << mst_score << std::endl ;
std::cout << "Meta clusters: " << std::endl;
//Showcase the iterators
MST_meta_cluster_iterator mm_it;
MST_cluster_iterator mc_it;
for(mm_it = mst.meta_clusters_begin(); mm_it != mst.meta_clusters_end(); mm_it++)
{
mc_it = mst.clusters_in_meta_cluster_begin(*mm_it);
std::cout << *mm_it << " <==> {" << *mc_it;
mc_it++;
for(; mc_it != mst.clusters_in_meta_cluster_end(*mm_it); mc_it++)
std::cout << ", " << *mc_it;
std::cout << "}" << std::endl;
}
std::cout << "RST 3-family matching: \nScore: " << rst_score << std::endl;
std::cout << "Meta clusters: " << std::endl;
RST_meta_cluster_iterator rm_it;
RST_cluster_iterator rc_it;
for(rm_it = rst.meta_clusters_begin(); rm_it != rst.meta_clusters_end(); rm_it++)
{
rc_it = rst.clusters_in_meta_cluster_begin(*rm_it);
std::cout << *rm_it << " <==> {" << *rc_it;
rc_it++;
for(; rc_it != rst.clusters_in_meta_cluster_end(*rm_it); rc_it++)
std::cout << ", " << *rc_it;
std::cout << "}" << std::endl;
}
return 0;
}

Applications

This package provides an executable to compute the meta-clusters from two input clusterings. It is a direct implementation of the $\stsGD$ algorithm.

Example 1: Computing the $D-\FM$ from two clusterings obtained by using the $\kmeanspp$ algorithm (provided in Cluster_engines). Then, visualizing the corresponding clusters and meta-clusters:

#cluster the input data (contained in the .txt file) set into 5 cluster using k-means++
sbl-cluster-k-means-euclid.exe --k-means-k 5 --points-file points-N1000-d20.txt --k-means-selector=plusplus -o

#display the clusters
sbl-clusters-display.py -f points-N1000-d20.txt -c sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_5__clusters.txt -C sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_5__centers.txt

#cluster the input data (contained in the .txt file) set into 20 cluster using k-means++
sbl-cluster-k-means-euclid.exe --k-means-k 20 --points-file points-N1000-d20.txt --k-means-selector=plusplus - o

#display the clusters
sbl-clusters-display.py -f points-N1000-d20.txt -c sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_20__clusters.txt -C sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_20__centers.txt

#compute the d-family-matching on the two previous clusterings with D=4 and 10000 iterations
sbl-d-family-matching.exe -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_5__clusters.txt -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_20__clusters.txt --diameter-constraint 4 --num-iterations 10000

#create a clustering file from the meta-cluster mapping and the original cluster file
sbl-map-meta-clusters.py -c sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_5__clusters.txt -m sbl-d-family-matching__clust1-meta-clusters.txt -o meta-clusters.txt

#display the meta-clusters
sbl-clusters-display.py -c meta-clusters.txt -f points-N1000-d20.txt 

#compute the d-family-matching but add a threshold on the intersection graph edge weights
sbl-d-family-matching.exe -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_5__clusters.txt -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_20__clusters.txt --diameter-constraint 4 --num-iterations 10000 --intersection-threshold 10

#create a clustering file
sbl-map-meta-clusters.py -c sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_5__clusters.txt -m sbl-d-family-matching__clust1-meta-clusters.txt -o meta-clusters.txt

#display the meta clusters
sbl-clusters-display.py -c meta-clusters.txt -f points-N1000-d20.txt 

(A)
(B)

(C)
(D)

(A) The result of $\kmeanspp$ with five clusters. (B) The result of $\kmeanspp$ with twenty clusters. (C) The result of the $\stsGD$ algorithm after 10000 iterations. Note that the algorithm is sensitive to outliers, this results in returning a unique meta-cluster. (D) The result of the $\stsGD$ algorithm after 10000 iterations with an intersection threshold of 10. By using the –intersection-threshold option, we can specify a threshold for the edge weights under which the corresponding edges will not be added to the graph. This allows to correctly identify the "correct" meta-clusters for this example.

Example 2: Computing the variation of information from two previous clusterings and comparing it to the $D-\FM$ score for various values of $D$. We use normalized versions of $\mathcal{VI}$ and our score ( $s_{\mathcal{VI}} = \mathcal{VI} / \log{t}, s_{\Phi} = 1 - \Phi/t$):

#compute the variation of information:
sbl-VI.exe -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_5__clusters.txt -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_20__clusters.txt 

#compute the d-family-matching on the two previous clusterings with D=1 and 10000 iterations
sbl-d-family-matching.exe -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_5__clusters.txt -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_20__clusters.txt --diameter-constraint 1 --num-iterations 10000

#compute the d-family-matching on the two previous clusterings with D=2 and 10000 iterations
sbl-d-family-matching.exe -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_5__clusters.txt -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_20__clusters.txt --diameter-constraint 2 --num-iterations 10000

#compute the d-family-matching on the two previous clusterings with D=3 and 10000 iterations
sbl-d-family-matching.exe -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_5__clusters.txt -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_20__clusters.txt --diameter-constraint 3 --num-iterations 10000

#compute the d-family-matching on the two previous clusterings with D=4 and 10000 iterations
sbl-d-family-matching.exe -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_5__clusters.txt -f sbl-cluster-k-means-euclid__f_points-N1000-d20__mode_plusplus__iter_max_10__k_20__clusters.txt --diameter-constraint 4 --num-iterations 10000

$s_{\mathcal{VI}}$
$s_{\Phi}, D=1$
$s_{\Phi}, D=2$
$s_{\Phi}, D=3$
$s_{\Phi}, D=4$
0.45
0.33
0.08
0.08
0

Our comparison scheme rapidly converges to 0 as most smaller clusters of the second clustering can be aggregated into one of the bigger clusters of the first clustering when $D \geq 2$.

The reader is referred to [35] for a detailed comparison of VI against the scores yielded by our comparison strategy.