Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.

Authors: R. Tetley and F. Cazals and D. Mazauric
While clustering is ubiquitous in data science, with selected methods provided in the package Cluster_engines, picking a particular clustering algorithm and tuning its parameters is in general a non trivial task.
In order to alleviate this task, this package provides methods to compare two clusterings, by computing a mapping between the clusters of these clusterings. In doing so, groups of clusters called metaclusters are formed within each clustering, and a 1to1 correspondence between these metaclusters is provided. (We note in passing that our mapping goes beyond a 1to1 matching between the clusters [92], [55] .)
The following comments are in order:
In the sequel, we formalize the clustering comparison problem in a manner answering the two needs just discussed, and document the classes provided. For comparison purposes, this package also provides an implementation of the variation of information (VI) [104] , a metric based on information theoretical concepts.
The score of a familymatching is .
Intuitively, the Dfamilymatching problem involves computing disjoint subset of nodes (clusters of clusters, or metaclusters) such that the inconsitencies are minimized (clusters which are in the same metacluster have a minimum number of divergent points).
The decision version of the problem is NPcomplete for:
The problem is NPcomplete for bipartite graphs of maximum degree and with unary weights.
There exist polynomial time algorithms for certain classes of graphs. Denoting and the number of vertices and edges of the intersection graph, the following is proved in [35] :
The implemented algorithms follows the generic design presented in [35] .
We implemented a version of the previous algorithm called (for Spanning Tree Solver):
The package is centered around the generic the algorithm discussed above.
This example loads two input clusterings and computes their 3family matching using random spanning trees or a minimum spanning tree.
This package provides an executable to compute the metaclusters from two input clusterings. It is a direct implementation of the algorithm.
Example 1: Computing the from two clusterings obtained by using the algorithm (provided in Cluster_engines). Then, visualizing the corresponding clusters and metaclusters:
#cluster the input data (contained in the .txt file) set into 5 cluster using kmeans++ sblclusterkmeanseuclid.exe kmeansk 5 pointsfile pointsN1000d20.txt kmeansselector=plusplus o #display the clusters sblclustersdisplay.py f pointsN1000d20.txt c sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_5__clusters.txt C sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_5__centers.txt #cluster the input data (contained in the .txt file) set into 20 cluster using kmeans++ sblclusterkmeanseuclid.exe kmeansk 20 pointsfile pointsN1000d20.txt kmeansselector=plusplus  o #display the clusters sblclustersdisplay.py f pointsN1000d20.txt c sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_20__clusters.txt C sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_20__centers.txt #compute the dfamilymatching on the two previous clusterings with D=4 and 10000 iterations sbldfamilymatching.exe f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_5__clusters.txt f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_20__clusters.txt diameterconstraint 4 numiterations 10000 #create a clustering file from the metacluster mapping and the original cluster file sblmapmetaclusters.py c sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_5__clusters.txt m sbldfamilymatching__clust1metaclusters.txt o metaclusters.txt #display the metaclusters sblclustersdisplay.py c metaclusters.txt f pointsN1000d20.txt #compute the dfamilymatching but add a threshold on the intersection graph edge weights sbldfamilymatching.exe f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_5__clusters.txt f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_20__clusters.txt diameterconstraint 4 numiterations 10000 intersectionthreshold 10 #create a clustering file sblmapmetaclusters.py c sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_5__clusters.txt m sbldfamilymatching__clust1metaclusters.txt o metaclusters.txt #display the meta clusters sblclustersdisplay.py c metaclusters.txt f pointsN1000d20.txt
(A) The result of with five clusters. (B) The result of with twenty clusters. (C) The result of the algorithm after 10000 iterations. Note that the algorithm is sensitive to outliers, this results in returning a unique metacluster. (D) The result of the algorithm after 10000 iterations with an intersection threshold of 10. By using the –intersectionthreshold option, we can specify a threshold for the edge weights under which the corresponding edges will not be added to the graph. This allows to correctly identify the "correct" metaclusters for this example. 
Example 2: Computing the variation of information from two previous clusterings and comparing it to the score for various values of . We use normalized versions of and our score ( ):
#compute the variation of information: sblVI.exe f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_5__clusters.txt f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_20__clusters.txt #compute the dfamilymatching on the two previous clusterings with D=1 and 10000 iterations sbldfamilymatching.exe f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_5__clusters.txt f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_20__clusters.txt diameterconstraint 1 numiterations 10000 #compute the dfamilymatching on the two previous clusterings with D=2 and 10000 iterations sbldfamilymatching.exe f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_5__clusters.txt f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_20__clusters.txt diameterconstraint 2 numiterations 10000 #compute the dfamilymatching on the two previous clusterings with D=3 and 10000 iterations sbldfamilymatching.exe f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_5__clusters.txt f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_20__clusters.txt diameterconstraint 3 numiterations 10000 #compute the dfamilymatching on the two previous clusterings with D=4 and 10000 iterations sbldfamilymatching.exe f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_5__clusters.txt f sblclusterkmeanseuclid__f_pointsN1000d20__mode_plusplus__iter_max_10__k_20__clusters.txt diameterconstraint 4 numiterations 10000
Our comparison scheme rapidly converges to 0 as most smaller clusters of the second clustering can be aggregated into one of the bigger clusters of the first clustering when .
The reader is referred to [35] for a detailed comparison of VI against the scores yielded by our comparison strategy.