Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

D_family_matching

Authors: R. Tetley and F. Cazals and D. Mazauric

Introduction

While clustering is ubiquitous in data science, with selected methods provided in the package Cluster_engines, picking a particular clustering algorithm and tuning its parameters is in general a non trivial task.

In order to alleviate this task, this package provides methods to compare two clusterings, by computing a mapping between the clusters of these clusterings. In doing so, groups of clusters called meta-clusters are formed within each clustering, and a 1-to-1 correspondence between these meta-clusters is provided. (We note in passing that our mapping goes beyond a 1-to-1 matching between the clusters [115], [72] .)

The following comments are in order:

  • It has been noticed [166] that In fact, the right number of clusters in a dataset often depends on the scale at which the dataset is inspected. In this package, we provide algorithms which are parameterized by a scale parameter $D$, namely the maximum diameter of the subgraph connecting clusters to define meta-clusters.
  • Various cluster comparison measures have been proposed, and we refer the reader to [46] for a thorough review of these. The main drawback of these methods are that (i) they typically return a number qualifying the coherence between the clusterings, without any mapping the clusters with one another, and (ii) they do not accommodate any scale parameter.

In the sequel, we formalize the clustering comparison problem in a manner answering the two needs just discussed, and document the classes provided. For comparison purposes, this package also provides an implementation of the variation of information (VI) [126] , a metric based on information theoretical concepts.

Algorithms

Terminology and pre-requisites

Consider a dataset $P$ consisting of $t$ observations (data points or points for short). A clustering is a partition of the input dataset into disjoint subsets called clusters.


A partial clustering is a collection of clusters whose union is strictly contained in the input dataset (e.g. some outliers have been removed).


Consider a dataset $P$ and two partial clusterings $F = \{F_{1}, \ldots, F_{r} \}$ and $F' = \{F'_{1}, \ldots, F'_{r'} \}$ of $P$. The edge-weighted intersection graph $G = (U, U', E, w)$ is then constructed as follows. The set $U = \{u_{1}, \dots, u_{r}\}$ corresponds to the clustering $F$. To each vertex $u_{i}$, we associate the set $F_{i} \in F$. The set $U' = \{u'_{1}, \dots, u'_{r'}\}$ corresponds to the clustering $F'$. To each vertex $u'_{i}$, we associate the set $F'_{i} \in F$. The set of edges of $G$ is $E = \{\{u_{i},u'_{j}\} \mid F_{i} \cap F'_{j} \neq \emptyset, 1 \leq i \leq r, 1 \leq j \leq r'\}$. The weight of any edge $e = \{u_{i}, u'_{j}\} \in E$ is $w_{e} = |F_{i} \cap F'_{j}|$. The number of edges and vertices of the intersection graph are respectively denoted $n$ and $m$.


Consider two clusterings $F = \{F_{1}, \ldots, F_{r} \}$ and $F' = \{F'_{1}, \ldots, F'_{r'} \}$ of an identical dataset $P$. To a given cluster $F_i$ we associate $\mathcal{P}(i) =\frac{t_i}{t}$ where $t_i = |F_i|$ and $t = |P|$. The entropy associated with clustering $F$ is defined as $\mathcal{H} = -\sum\limits_{i=1}^r\mathcal{P}(i)\log{P(i)}$. Let $F_i \in F$ and $F'_j \in F'$, define $\mathcal{P}(i, j) = \frac{|F_i \cup F'_j|}{t}$. The mutual information between $F$ and $F'$ is defined as $\mathcal{I}(F, F') = \sum\limits_{i=1}^r\sum\limits_{j=1}^{r'}\mathcal{P}(i, j)\log{\frac{\mathcal{P}(i, j)}{\mathcal{P}(i)\mathcal{P}(j)}}$. From this, we define the variation of information between $F$ and $F'$ as $\mathcal{VI}(F, F') = \mathcal{H}(F) + \mathcal{H}(F') - 2\mathcal{I}(F, F')$.


Problem statement

Let $D \in \mathbb{N}^{+}$. Let $G = (V,E,w)$ be an intersection graph. A $D-\FM$ for $G$ is a family $\mathcal{S} = \{S_{1}, \ldots, S_{k}\}, k \geq 1$, such that, for every $i,j \in \{1, \ldots, k\}$, if $i \neq j$, then: $S_{i} \subseteq V, S_{i} \neq \emptyset, S_{i} \cap S_{j} = \emptyset$ and the graph $G[S_{i}]$ induced by the set of nodes $S_{i}$ has diameter at most $D$.


The score $\scoreDFM{\mathcal{S}}$ of a $D$-family-matching $\mathcal{S}$ is $\scoreDFM{\mathcal{S}} = \sum\limits_{i=1}^{k}~ \sum\limits_{e \in E(G[S_{i}])} w_{e}$.

Let $\mathcal{S}_{D}(G)$ be the set of all $D-\FM~$ for $G$. Let $D \in \mathbb{N}^{+}$. Given an intersection graph $G$, the $D-\FM~problem$ consists in computing $\scoreOpt{D}{G} = \max_{\mathcal{S} \in \mathcal{S}_{D}(G)} \scoreDFM{\mathcal{S}}$.


Intuitively, the D-family-matching problem involves computing disjoint subset of nodes (clusters of clusters, or meta-clusters) such that the inconsitencies are minimized (clusters which are in the same meta-cluster have a minimum number of divergent points).

Problem hardness

The decision version of the $D-\FM$ problem is NP-complete for:

  • bipartite graphs of maximum degree $3$
  • bipartite graphs of maximum degree $4$ even if the maximum weight is constant.

The $2-\FM$ problem is NP-complete for bipartite graphs of maximum degree $3$ and with unary weights.

There exist polynomial time algorithms for certain classes of graphs. Denoting $n$ and $m$ the number of vertices and edges of the intersection graph, the following is proved in [46] :

  • For a tree $T$, the $D-\FM$ problem can be solved in $O(D^2\Delta^2n)$ time (where $\Delta$ is the maximum degree of $T$).
  • For a path $P$, the $D-\FM$ problem can be solved in $O(Dn)$ time.
  • For a cycle $C$ (or any graph of maximum degree 2), the $D-\FM$ problem can be solved in $O(D^2n)$ time.

Algorithms

The implemented algorithms follows the generic design presented in [46] .

\begin{algorithm} \begin{algorithmic}[1] \REQUIRE{An intersection graph $G = (V,E,w)$, an integer $D \geq 1$, a property $\Pi$, a spanning tree generator $\mathcal{R}$, and an algorithm~$\mathcal{A}$.} \STATE $\mathcal{M} := \emptyset$, $\lambda := 0$ \WHILE{$\neg~\Pi(\mathcal{M})$} \STATE $\lambda : = \lambda + 1$; Compute the spanning tree $T^{\lambda} := \mathcal{R}(G,\lambda)$ \STATE Compute $\mathcal{S}^{\lambda}$ by using Algorithm~$\mathcal{A}(G,T^{\lambda},D)$; $\mathcal{M} := \mathcal{M} \cup \mathcal{S}^{\lambda}$ \ENDWHILE \RETURN $\mathcal{S} \in \mathcal{M}$ of maximum score \end{algorithmic} \caption{Generic algorithm for the $D$-\FM~problem.} \end{algorithm}

We implemented a version of the previous algorithm called $\stsGD$ (for Spanning Tree Solver):

  • the spanning tree generator $\mathcal{R}$ returns a maximum spaning tree, or a random spanning tree ;
  • the property $\Pi(\mathcal{M})$ returns true once we have computed a solution on the maximum spanning tree, as well as a solution on $n_i$ distinct random spanning trees (for a given $n_i$) or if the $p$ first distinct solutions have a score contained in an interval of size $\epsilon$.
  • $\mathcal{A}$ is the polynomial time algorithm on trees described in [46] with an additional step: edges for which both extremities belong to the same meta-cluster (and previously unaccounted for) are added to the said meta-cluster.

Implementation and functionalities

Design

The package is centered around the generic the algorithm discussed above.

  • We provide two module classes (for modules, see Module_base ): SBL::Modules::T_RST_MST_D_Family_matching_module< ModuleTraits > is the $\stsGD$ algorithm packaged in a module, SBL::Modules::T_D_Family_matching_module< Graph , TSolutionComputer > lets the user decide which solution computer to use (can be a custom one or any of the ones provided).
  • The class SBL::CADS::T_VI_computer< Clustering > provides a functor for computing the so called variation of information.

Functionalities

  • Generic input: two text files giving for each point the index of its clusters (as outputed by Cluster_engines)
  • D-family-matching output: for each clustering, a .txt file mapping each original cluster to its meta-cluster, a .dot file containing the intersection graph as well as the solution highlighted in red, a .xml file containing various statistics on all the computed solutions (number of meta-clusters, score, size, etc...)
Visualization: we also provide scripts for displaying the meta-clusters on the original point clouds.


Examples

This example loads two input clusterings and computes their 3-family matching using random spanning trees or a minimum spanning tree.

Applications

Programs

This package provides two executables to compare clusterings:

  • sbl-cmp-clust-VI.exe : computes the so-called Variation of Information.
  • sbl-cmp-clust-dfam.exe : computes the meta-clusters from two input clusterings. It is a direct implementation of the $\stsGD$ algorithm. This executable is described thereafter.

Main specifications

Main options. The main options are:

The main options are: –filenames string: Clustering filename – must be passed twice
–diameter-constraint string: Max diameter for meta-cluster
–num-iterations string: Num of iterations for random spanning trees


Input. The two input clusterings. Format specified in the package Cluster_engines .

Main output.

Consists of three files:

  • Graph of metaclusters, dot format, suffix __graph.dot: the bipartite graph whose nodes are the clusters, and the connected components defined meta-clusters
  • Metacluster file, txt format, suffix (clust1 | clust2) _meta_clusters.txt : the text file assigning clusters of the first clustering (resp. second clustering) to metaclusters.

Jupyter demo

See the following jupyter notebook:

  • Jupyter notebook file
  • D_family_matching

    D_family_matching

    Useful functions

    In [1]:
    from SBL import sbl_jupyter as sblj
    help(sblj)
    
    Help on module SBL.sbl_jupyter in SBL:
    
    NAME
        SBL.sbl_jupyter
    
    DESCRIPTION
        ## Load as follows
        #from SBL import sbl_jupyter as sblj
        #help(sblj)
        #
        ## Use as follows
        # sblj.SBLjupyter.find_and_show_images(".png", odir, 50)
    
    CLASSES
        builtins.object
            tools
        
        class tools(builtins.object)
         |  Static methods defined here:
         |  
         |  convert_eps_to_png(ifname, osize)
         |  
         |  convert_pdf_to_png(ifname, osize)
         |  
         |  find_and_convert(suffix, odir, osize)
         |      # find file with suffix, convert, and return image file
         |  
         |  find_and_show_images(suffix, odir, osize)
         |  
         |  find_file_in_output_directory(suffix, odir)
         |  
         |  show_eps_file(eps_file, osize)
         |  
         |  show_image(img)
         |  
         |  show_log_file(odir)
         |  
         |  show_pdf_file(pdf_file)
         |  
         |  show_row_n_images(images, size)
         |  
         |  ----------------------------------------------------------------------
         |  Data descriptors defined here:
         |  
         |  __dict__
         |      dictionary for instance variables (if defined)
         |  
         |  __weakref__
         |      list of weak references to the object (if defined)
    
    FILE
        /user/fcazals/home/projects/proj-soft/sbl/python/SBL/sbl_jupyter.py
    
    
    

    Toy example: comparing two clusters

    Options.

    We provide a simple example using the executable sbl-cmp-clust-dfam.exe.

    The main options of the cmpClusters in the next cell are:

    • cluster1: Path to the first cluster file
    • cluster2: Path to the second second cluster file
    • method: method (dfam or VI)
      • dfam: using the D_family_matching
      • VI: using the Variation of Information
    In [2]:
    import re  #regular expressions
    import sys #misc system
    import os
    import pdb
    import shutil # python 3 only
    
    import matplotlib.pyplot as plt
    import matplotlib.image as mplimg
    
    def cmp_clusters(cluster1, cluster2, method = "dfam"): 
        
        odir = "results-toy-%s" % method
        if os.path.exists(odir):
            os.system("rm -rf %s" % odir)
        os.system( ("mkdir %s" % odir) )
    
       # check executable exists and is visible
        exe = shutil.which("sbl-cmp-clust-%s.exe" % method)
        if exe:
            print(("Using executable %s\n" % exe))
            cmd = "sbl-cmp-clust-%s.exe -f %s -f %s -d %s --log" % (method, cluster1, cluster2,odir)
            print(cmd)
            os.system(cmd)
            
            cmd = "ls %s/*" % odir
            ofiles = os.popen(cmd).readlines()
            print("All output files:",ofiles)
            
            # we also generate the graph connecting clusters withig meta-clusters, see graph below
            if(method=="dfam"):
                cmd = "dot -Tpdf %s/sbl-d-family-matching__graph.dot -o %s/graph.pdf" % (odir,odir)
                os.system(cmd)
                sblj.tools.convert_pdf_to_png( ("%s/graph.pdf" % odir), 150 )
                img=mplimg.imread( ("%s/graph.png" % odir) )
                plt.xticks([]); plt.yticks([])
                imgplot = plt.imshow(img)
    
            #find the log file and display log file
            #log = open( ("%s/log.txt") % odir).readlines()
            #for line in log:         print(line.rstrip())
            
        else:
            print("Executable not found")
            
    
    In [3]:
    print("Marker : Calculation Started")
    #cmp_clusters("data/cluster_1.txt","data/cluster_2.txt", "VI")   
    cmp_clusters("data/cluster_1.txt","data/cluster_2.txt")   
    print("Marker : Calculation Ended")
    
    Marker : Calculation Started
    Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-cmp-clust-dfam.exe
    
    sbl-cmp-clust-dfam.exe -f data/cluster_1.txt -f data/cluster_2.txt -d results-toy-dfam --log
    All output files: ['results-toy-dfam/sbl-d-family-matching__clust1_meta_clusters.txt\n', 'results-toy-dfam/sbl-d-family-matching__clust2_meta_clusters.txt\n', 'results-toy-dfam/sbl-d-family-matching__graph.dot\n', 'results-toy-dfam/sbl-d-family-matching__k_family_matching.xml\n', 'results-toy-dfam/sbl-d-family-matching__log.txt\n']
    Marker : Calculation Ended
    

    Comparing clusterings produced by kmeans

    In the following, we use sbl-cmp-clust-dfam.exe to compare clusterings produced by kmeans. See the package Clustering_engines for the clustering algorithms.

    In the following, we perform the following steps:

    • Step 1: prepare the individual clusterings of the datasets
    • Step 1b: plot clusterings and collect the images
    • Step 2: compute the D-family matchig with the constraint D_max given
    • Step 3: create a clustering file from the meta-cluster mapping and the original cluster file
    • Step 4: display all results
    • Step 5 : extract statistics from xml output file with PALSE
    In [4]:
    import os
    import pdb
    import sys
    
    from SBL import PALSE
    from PALSE import *
    
    #cluster the input data (contained in the .txt file) set into 5 cluster using k-means++
    points_files = ["data/points-N200-d50.txt", "data/points-N200-d50.txt"]
    k_values = [10, 20]
    
    
    def run_with_diameter_constraint(D_max):
    
        odir = "results-%s" % D_max
        if os.path.exists(odir):
            os.system("rm -rf %s" % odir)
        os.system( ("mkdir %s" % odir) )
    
        # Step 1: prepare the individual clusterings of the datasets
        clusters_files = [] # files containing cluters
        centers_files = [] # files containing cluster centers
        for i in range(0, len(points_files)):
                  cmd = "sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus\
                  --points-file %s --k-means-k %d -o -d %s" % (points_files[i], k_values[i], odir)
                  print("Executing ",cmd)
                  os.system(cmd)
    
                  cmd = "find %s -name \"*k_%s__clusters.txt\"" % (odir, k_values[i])
                  clusters_files.append( os.popen(cmd).readlines()[0].rstrip() )
    
                  cmd = "find %s  -name \"*k_%s__centers.txt\"" % (odir, k_values[i])
                  centers_files.append( os.popen(cmd).readlines()[0].rstrip() )
    
        # Step 1b: plot clusterings and collect the images
        for i in range(0, len(points_files)):
            cmd = "sbl-clusters-display.py -f %s -c %s -C %s -o %s" % (points_files[i], clusters_files[i], centers_files[i], odir)
            os.system(cmd)
    
        # collect the plots
        cmd = "find %s -name \"*centers.png\"" % odir
        lines = os.popen(cmd).readlines()
        images = [line.rstrip() for line in lines]
    
        # Step 2: compute the D-family matchig with the constraint D_max given
        # Nb: we use 100 iterations by default.
        cmd = "sbl-cmp-clust-dfam.exe -f %s -f %s --diameter-constraint %d --num-iterations 100 -d %s" % (clusters_files[0], clusters_files[1], D_max, odir)
        print("Clustering comparison command:",cmd)
        os.system(cmd)
    
        # Step 3: create a clustering file from the meta-cluster mapping and the original cluster file
        cmd = "find %s -name \"*clust1_meta_clusters.txt\"" % odir
        meta_cluster_file = os.popen(cmd).readlines()[0].rstrip()
        #print(meta_cluster_file)
    
        cmd = "sbl-map-meta-clusters.py  -c %s -m %s  -o %s/final-clustering.txt" % (clusters_files[0], meta_cluster_file,odir)
        os.system(cmd)
    
        cmd = "sbl-clusters-display.py  -f %s -c %s/final-clustering.txt -o %s" % (points_files[0],odir,odir)
        print("Cluster display command for meta clustering:",cmd)
        os.system(cmd)
    
        # Step 4: display all results
        cmd = "find %s  -name \"*final-clustering.png\"" % odir
        final_png = os.popen(cmd).readlines()[0].rstrip()
    
        images.append(final_png)
        sblj.tools.show_row_n_images(images, 50)
    
            
        # Step 5 : palse
        database = PALSE_xml_DB()
        database.load_from_directory(odir, ".*family_matching")
    
        # total score
        score = database.get_all_data_values_from_database("d_family_matching/solution/best_solution/score", int)[0][0]
    
        # stats on meta-clusters
        meta_cluster_sizes = database.get_all_data_values_from_database("d_family_matching/solution/best_solution/meta_cluster_sizes/item", int)[0]
        print("Score of the matching:",score)
        print("Num meta clusters:", len(meta_cluster_sizes))
        print("Num of clusters in each meta-clusters:",meta_cluster_sizes)
    
        return odir
     
    

    We now run the previous function for increasing values of D_max.

    In particular, we analyze the incidence of the maximum diameter constraint. Given that the original input file contains 1000 points, note that the total score (num. of points found in meta-clusters) converges rapidly to the maximum possible balue when increasing D_max.

    NB: caption for figures:

    • left plot: first clustering
    • middle plot: second clustering
    • right plot: meta-clustering
    In [5]:
    run_with_diameter_constraint(2)
    
    Executing  sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus              --points-file data/points-N200-d50.txt --k-means-k 10 -o -d results-2
    Executing  sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus              --points-file data/points-N200-d50.txt --k-means-k 20 -o -d results-2
    Clustering comparison command: sbl-cmp-clust-dfam.exe -f results-2/sbl-cluster-k-means-euclid__f_points-N200-d50__mode_plusplus__iter_max_10__k_10__clusters.txt -f results-2/sbl-cluster-k-means-euclid__f_points-N200-d50__mode_plusplus__iter_max_10__k_20__clusters.txt --diameter-constraint 2 --num-iterations 100 -d results-2
    Cluster display command for meta clustering: sbl-clusters-display.py  -f data/points-N200-d50.txt -c results-2/final-clustering.txt -o results-2
    Figs displayed
    XML: 1 / 1 files were loaded
    
    Score of the matching: 786
    Num meta clusters: 9
    Num of clusters in each meta-clusters: [4, 3, 3, 3, 3, 3, 2, 4, 5]
    
    Out[5]:
    'results-2'
    In [6]:
    run_with_diameter_constraint(4)
    
    Executing  sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus              --points-file data/points-N200-d50.txt --k-means-k 10 -o -d results-4
    Executing  sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus              --points-file data/points-N200-d50.txt --k-means-k 20 -o -d results-4
    Clustering comparison command: sbl-cmp-clust-dfam.exe -f results-4/sbl-cluster-k-means-euclid__f_points-N200-d50__mode_plusplus__iter_max_10__k_10__clusters.txt -f results-4/sbl-cluster-k-means-euclid__f_points-N200-d50__mode_plusplus__iter_max_10__k_20__clusters.txt --diameter-constraint 4 --num-iterations 100 -d results-4
    Cluster display command for meta clustering: sbl-clusters-display.py  -f data/points-N200-d50.txt -c results-4/final-clustering.txt -o results-4
    Figs displayed
    XML: 1 / 1 files were loaded
    
    Score of the matching: 956
    Num meta clusters: 6
    Num of clusters in each meta-clusters: [4, 6, 9, 3, 3, 5]
    
    Out[6]:
    'results-4'
    In [7]:
    run_with_diameter_constraint(8)
    
    Executing  sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus              --points-file data/points-N200-d50.txt --k-means-k 10 -o -d results-8
    Executing  sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus              --points-file data/points-N200-d50.txt --k-means-k 20 -o -d results-8
    Clustering comparison command: sbl-cmp-clust-dfam.exe -f results-8/sbl-cluster-k-means-euclid__f_points-N200-d50__mode_plusplus__iter_max_10__k_10__clusters.txt -f results-8/sbl-cluster-k-means-euclid__f_points-N200-d50__mode_plusplus__iter_max_10__k_20__clusters.txt --diameter-constraint 8 --num-iterations 100 -d results-8
    Cluster display command for meta clustering: sbl-clusters-display.py  -f data/points-N200-d50.txt -c results-8/final-clustering.txt -o results-8
    Figs displayed
    XML: 1 / 1 files were loaded
    
    Score of the matching: 1000
    Num meta clusters: 4
    Num of clusters in each meta-clusters: [5, 6, 11, 8]
    
    Out[7]:
    'results-8'

The reader is referred to [46] for a detailed comparison of VI against the scores yielded by our comparison strategy.