D_family_matching

Authors: R. Tetley and F. Cazals and D. Mazauric

Introduction

While clustering is ubiquitous in data science, with selected methods provided in the package Cluster_engines, picking a particular clustering algorithm and tuning its parameters is in general a non trivial task.

In order to alleviate this task, this package provides methods to compare two clusterings, by computing a mapping between the clusters of these clusterings. In doing so, groups of clusters called meta-clusters are formed within each clustering, and a 1-to-1 correspondence between these meta-clusters is provided. (We note in passing that our mapping goes beyond a 1-to-1 matching between the clusters [115], [72] .)

The following comments are in order:

It has been noticed [166] that In fact, the right number of clusters in a dataset often depends on the scale at which the dataset is inspected. In this package, we provide algorithms which are parameterized by a scale parameter , namely the maximum diameter of the subgraph connecting clusters to define meta-clusters.

Various cluster comparison measures have been proposed, and we refer the reader to [46] for a thorough review of these. The main drawback of these methods are that (i) they typically return a number qualifying the coherence between the clusterings, without any mapping the clusters with one another, and (ii) they do not accommodate any scale parameter.

In the sequel, we formalize the clustering comparison problem in a manner answering the two needs just discussed, and document the classes provided. For comparison purposes, this package also provides an implementation of the variation of information (VI) [126] , a metric based on information theoretical concepts.

Algorithms

Terminology and pre-requisites

Consider a dataset

consisting of

observations (data points or points for short). A clustering is a partition of the input dataset into disjoint subsets called clusters.

A partial clustering is a collection of clusters whose union is strictly contained in the input dataset (e.g. some outliers have been removed).

Consider a dataset

and two partial clusterings $F = \{F_{1}, \ldots, F_{r} \}$ and $F' = \{F'_{1}, \ldots, F'_{r'} \}$ of

. The edge-weighted intersection graph

is then constructed as follows. The set $U = \{u_{1}, \dots, u_{r}\}$ corresponds to the clustering

. To each vertex $u_{i}$ , we associate the set $F_{i} \in F$ . The set $U' = \{u'_{1}, \dots, u'_{r'}\}$ corresponds to the clustering

. To each vertex $u'_{i}$ , we associate the set $F'_{i} \in F$ . The set of edges of

is $E = \{\{u_{i},u'_{j}\} \mid F_{i} \cap F'_{j} \neq \emptyset, 1 \leq i \leq r, 1 \leq j \leq r'\}$ . The weight of any edge $e = \{u_{i}, u'_{j}\} \in E$ is $w_{e} = |F_{i} \cap F'_{j}|$ . The number of edges and vertices of the intersection graph are respectively denoted

and

.

Consider two clusterings $F = \{F_{1}, \ldots, F_{r} \}$ and $F' = \{F'_{1}, \ldots, F'_{r'} \}$ of an identical dataset

. To a given cluster

we associate $\mathcal{P}(i) =\frac{t_i}{t}$ where

and

. The entropy associated with clustering

is defined as $\mathcal{H} = -\sum\limits_{i=1}^r\mathcal{P}(i)\log{P(i)}$ . Let $F_i \in F$ and $F'_j \in F'$ , define $\mathcal{P}(i, j) = \frac{|F_i \cup F'_j|}{t}$ . The mutual information between

and

is defined as $\mathcal{I}(F, F') = \sum\limits_{i=1}^r\sum\limits_{j=1}^{r'}\mathcal{P}(i, j)\log{\frac{\mathcal{P}(i, j)}{\mathcal{P}(i)\mathcal{P}(j)}}$ . From this, we define the variation of information between

and

as $\mathcal{VI}(F, F') = \mathcal{H}(F) + \mathcal{H}(F') - 2\mathcal{I}(F, F')$ .

Problem statement

Let $D \in \mathbb{N}^{+}$ . Let

be an intersection graph. A $D-\FM$ for

is a family $\mathcal{S} = \{S_{1}, \ldots, S_{k}\}, k \geq 1$ , such that, for every $i,j \in \{1, \ldots, k\}$ , if $i \neq j$ , then: $S_{i} \subseteq V, S_{i} \neq \emptyset, S_{i} \cap S_{j} = \emptyset$ and the graph $G[S_{i}]$ induced by the set of nodes $S_{i}$ has diameter at most

.

The score $\scoreDFM{\mathcal{S}}$ of a -family-matching $\mathcal{S}$ is $\scoreDFM{\mathcal{S}} = \sum\limits_{i=1}^{k}~ \sum\limits_{e \in E(G[S_{i}])} w_{e}$ .

Let $\mathcal{S}_{D}(G)$ be the set of all $D-\FM~$ for

. Let $D \in \mathbb{N}^{+}$ . Given an intersection graph

, the $D-\FM~problem$ consists in computing $\scoreOpt{D}{G} = \max_{\mathcal{S} \in \mathcal{S}_{D}(G)} \scoreDFM{\mathcal{S}}$ .

Intuitively, the D-family-matching problem involves computing disjoint subset of nodes (clusters of clusters, or meta-clusters) such that the inconsitencies are minimized (clusters which are in the same meta-cluster have a minimum number of divergent points).

Problem hardness

The decision version of the $D-\FM$ problem is NP-complete for:

bipartite graphs of maximum degree
bipartite graphs of maximum degree even if the maximum weight is constant.

The $2-\FM$ problem is NP-complete for bipartite graphs of maximum degree and with unary weights.

There exist polynomial time algorithms for certain classes of graphs. Denoting and the number of vertices and edges of the intersection graph, the following is proved in [46] :

For a tree , the $D-\FM$ problem can be solved in $O(D^2\Delta^2n)$ time (where $\Delta$ is the maximum degree of ).

For a path , the $D-\FM$ problem can be solved in time.

For a cycle (or any graph of maximum degree 2), the $D-\FM$ problem can be solved in time.

Algorithms

The implemented algorithms follows the generic design presented in [46] .

$\begin{algorithm} \begin{algorithmic}[1] \REQUIRE{An intersection graph $G = (V,E,w)$, an integer $D \geq 1$, a property $\Pi$, a spanning tree generator $\mathcal{R}$, and an algorithm~$\mathcal{A}$.} \STATE $\mathcal{M} := \emptyset$, $\lambda := 0$ \WHILE{$\neg~\Pi(\mathcal{M})$} \STATE $\lambda : = \lambda + 1$; Compute the spanning tree $T^{\lambda} := \mathcal{R}(G,\lambda)$ \STATE Compute $\mathcal{S}^{\lambda}$ by using Algorithm~$\mathcal{A}(G,T^{\lambda},D)$; $\mathcal{M} := \mathcal{M} \cup \mathcal{S}^{\lambda}$ \ENDWHILE \RETURN $\mathcal{S} \in \mathcal{M}$ of maximum score \end{algorithmic} \caption{Generic algorithm for the $D$-\FM~problem.} \end{algorithm}$

We implemented a version of the previous algorithm called $\stsGD$ (for Spanning Tree Solver):

the spanning tree generator $\mathcal{R}$ returns a maximum spaning tree, or a random spanning tree ;
the property $\Pi(\mathcal{M})$ returns true once we have computed a solution on the maximum spanning tree, as well as a solution on distinct random spanning trees (for a given ) or if the first distinct solutions have a score contained in an interval of size $\epsilon$ .
$\mathcal{A}$ is the polynomial time algorithm on trees described in [46] with an additional step: edges for which both extremities belong to the same meta-cluster (and previously unaccounted for) are added to the said meta-cluster.

Implementation and functionalities

Design

The package is centered around the generic the algorithm discussed above.

The class SBL::CADS::T_Spanning_tree_solver< Graph , StopCondition , SpanningTreeGenerator , Algorithm > implements the algorithm in the most generic manner possible.
The classes SBL::CADS::Maximum_spanning_tree_property and SBL::CADS::Random_spanning_tree_property provide ways to store solutions as well as compute stop conditions for the main algorithm.
The classes SBL::CADS::T_Maximum_spanning_tree_generator< Graph > and SBL::CADS::T_Random_spanning_tree_generator< Graph > allow the generation of spanning trees (maximum or random) on an input graph.
The class SBL::CADS::T_Tree_solver< Graph > contains the polynomial time algorithm to compute a solution on a tree.
Two pre-instantiated versions of the SBL::CADS::T_Spanning_tree_solver are provided: SBL::CADS::T_MST_Solution_computer< Graph > computes a unique solution on a maximum spanning tree, SBL::CADS::T_RST_Solution_computer< Graph > computes a user inputed number of solutions on random spanning trees and returns the best one.

We provide two module classes (for modules, see Module_base ): SBL::Modules::T_RST_MST_D_Family_matching_module< ModuleTraits > is the $\stsGD$ algorithm packaged in a module, SBL::Modules::T_D_Family_matching_module< Graph , TSolutionComputer > lets the user decide which solution computer to use (can be a custom one or any of the ones provided).

The class SBL::CADS::T_VI_computer< Clustering > provides a functor for computing the so called variation of information.

Functionalities

Generic input: two text files giving for each point the index of its clusters (as outputed by Cluster_engines)

D-family-matching output: for each clustering, a .txt file mapping each original cluster to its meta-cluster, a .dot file containing the intersection graph as well as the solution highlighted in red, a .xml file containing various statistics on all the computed solutions (number of meta-clusters, score, size, etc...)

Visualization: we also provide scripts for displaying the meta-clusters on the original point clouds.

Examples

This example loads two input clusterings and computes their 3-family matching using random spanning trees or a minimum spanning tree.

Applications

Programs

This package provides two executables to compare clusterings:

sbl-cmp-clust-VI.exe : computes the so-called Variation of Information.

sbl-cmp-clust-dfam.exe : computes the meta-clusters from two input clusterings. It is a direct implementation of the $\stsGD$ algorithm. This executable is described thereafter.

Main specifications

Main options. The main options are:

The main options are: –filenames string: Clustering filename – must be passed twice
–diameter-constraint string: Max diameter for meta-cluster
–num-iterations string: Num of iterations for random spanning trees

Input. The two input clusterings. Format specified in the package Cluster_engines .

Main output.

Consists of three files:

Graph of metaclusters, dot format, suffix __graph.dot: the bipartite graph whose nodes are the clusters, and the connected components defined meta-clusters

Metacluster file, txt format, suffix (clust1 | clust2) _meta_clusters.txt : the text file assigning clusters of the first clustering (resp. second clustering) to metaclusters.

Jupyter demo

See the following jupyter notebook:

Jupyter notebook file
D_family_matching
D_family_matching¶

Useful functions¶
In [1]:

from SBL import sbl_jupyter as sblj help(sblj)
Help on module SBL.sbl_jupyter in SBL: NAME SBL.sbl_jupyter DESCRIPTION ## Load as follows #from SBL import sbl_jupyter as sblj #help(sblj) # ## Use as follows # sblj.SBLjupyter.find_and_show_images(".png", odir, 50) CLASSES builtins.object tools class tools(builtins.object) | Static methods defined here: | | convert_eps_to_png(ifname, osize) | | convert_pdf_to_png(ifname, osize) | | find_and_convert(suffix, odir, osize) | # find file with suffix, convert, and return image file | | find_and_show_images(suffix, odir, osize) | | find_file_in_output_directory(suffix, odir) | | show_eps_file(eps_file, osize) | | show_image(img) | | show_log_file(odir) | | show_pdf_file(pdf_file) | | show_row_n_images(images, size) | | ---------------------------------------------------------------------- | Data descriptors defined here: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) FILE /user/fcazals/home/projects/proj-soft/sbl/python/SBL/sbl_jupyter.py
Toy example: comparing two clusters¶
Options.¶
We provide a simple example using the executable sbl-cmp-clust-dfam.exe.

The main options of the cmpClusters in the next cell are:

cluster1: Path to the first cluster file

cluster2: Path to the second second cluster file

method: method (dfam or VI)

dfam: using the D_family_matching

VI: using the Variation of Information
In [2]:

import re #regular expressions import sys #misc system import os import pdb import shutil # python 3 only import matplotlib.pyplot as plt import matplotlib.image as mplimg def cmp_clusters(cluster1, cluster2, method = "dfam"): odir = "results-toy-%s" % method if os.path.exists(odir): os.system("rm -rf %s" % odir) os.system( ("mkdir %s" % odir) ) # check executable exists and is visible exe = shutil.which("sbl-cmp-clust-%s.exe" % method) if exe: print(("Using executable %s\n" % exe)) cmd = "sbl-cmp-clust-%s.exe -f %s -f %s -d %s --log" % (method, cluster1, cluster2,odir) print(cmd) os.system(cmd) cmd = "ls %s/*" % odir ofiles = os.popen(cmd).readlines() print("All output files:",ofiles) # we also generate the graph connecting clusters withig meta-clusters, see graph below if(method=="dfam"): cmd = "dot -Tpdf %s/sbl-d-family-matching__graph.dot -o %s/graph.pdf" % (odir,odir) os.system(cmd) sblj.tools.convert_pdf_to_png( ("%s/graph.pdf" % odir), 150 ) img=mplimg.imread( ("%s/graph.png" % odir) ) plt.xticks([]); plt.yticks([]) imgplot = plt.imshow(img) #find the log file and display log file #log = open( ("%s/log.txt") % odir).readlines() #for line in log: print(line.rstrip()) else: print("Executable not found")
In [3]:

print("Marker : Calculation Started") #cmp_clusters("data/cluster_1.txt","data/cluster_2.txt", "VI") cmp_clusters("data/cluster_1.txt","data/cluster_2.txt") print("Marker : Calculation Ended")
Marker : Calculation Started Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-cmp-clust-dfam.exe sbl-cmp-clust-dfam.exe -f data/cluster_1.txt -f data/cluster_2.txt -d results-toy-dfam --log All output files: ['results-toy-dfam/sbl-d-family-matching__clust1_meta_clusters.txt\n', 'results-toy-dfam/sbl-d-family-matching__clust2_meta_clusters.txt\n', 'results-toy-dfam/sbl-d-family-matching__graph.dot\n', 'results-toy-dfam/sbl-d-family-matching__k_family_matching.xml\n', 'results-toy-dfam/sbl-d-family-matching__log.txt\n'] Marker : Calculation Ended
Comparing clusterings produced by kmeans¶
In the following, we use sbl-cmp-clust-dfam.exe to compare clusterings produced by kmeans. See the package Clustering_engines for the clustering algorithms.
In the following, we perform the following steps:¶

Step 1: prepare the individual clusterings of the datasets

Step 1b: plot clusterings and collect the images

Step 2: compute the D-family matchig with the constraint D_max given

Step 3: create a clustering file from the meta-cluster mapping and the original cluster file

Step 4: display all results

Step 5 : extract statistics from xml output file with PALSE
In [4]:

import os import pdb import sys from SBL import PALSE from PALSE import * #cluster the input data (contained in the .txt file) set into 5 cluster using k-means++ points_files = ["data/points-N200-d50.txt", "data/points-N200-d50.txt"] k_values = [10, 20] def run_with_diameter_constraint(D_max): odir = "results-%s" % D_max if os.path.exists(odir): os.system("rm -rf %s" % odir) os.system( ("mkdir %s" % odir) ) # Step 1: prepare the individual clusterings of the datasets clusters_files = [] # files containing cluters centers_files = [] # files containing cluster centers for i in range(0, len(points_files)): cmd = "sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus\ --points-file %s --k-means-k %d -o -d %s" % (points_files[i], k_values[i], odir) print("Executing ",cmd) os.system(cmd) cmd = "find %s -name \"*k_%s__clusters.txt\"" % (odir, k_values[i]) clusters_files.append( os.popen(cmd).readlines()[0].rstrip() ) cmd = "find %s -name \"*k_%s__centers.txt\"" % (odir, k_values[i]) centers_files.append( os.popen(cmd).readlines()[0].rstrip() ) # Step 1b: plot clusterings and collect the images for i in range(0, len(points_files)): cmd = "sbl-clusters-display.py -f %s -c %s -C %s -o %s" % (points_files[i], clusters_files[i], centers_files[i], odir) os.system(cmd) # collect the plots cmd = "find %s -name \"*centers.png\"" % odir lines = os.popen(cmd).readlines() images = [line.rstrip() for line in lines] # Step 2: compute the D-family matchig with the constraint D_max given # Nb: we use 100 iterations by default. cmd = "sbl-cmp-clust-dfam.exe -f %s -f %s --diameter-constraint %d --num-iterations 100 -d %s" % (clusters_files[0], clusters_files[1], D_max, odir) print("Clustering comparison command:",cmd) os.system(cmd) # Step 3: create a clustering file from the meta-cluster mapping and the original cluster file cmd = "find %s -name \"*clust1_meta_clusters.txt\"" % odir meta_cluster_file = os.popen(cmd).readlines()[0].rstrip() #print(meta_cluster_file) cmd = "sbl-map-meta-clusters.py -c %s -m %s -o %s/final-clustering.txt" % (clusters_files[0], meta_cluster_file,odir) os.system(cmd) cmd = "sbl-clusters-display.py -f %s -c %s/final-clustering.txt -o %s" % (points_files[0],odir,odir) print("Cluster display command for meta clustering:",cmd) os.system(cmd) # Step 4: display all results cmd = "find %s -name \"*final-clustering.png\"" % odir final_png = os.popen(cmd).readlines()[0].rstrip() images.append(final_png) sblj.tools.show_row_n_images(images, 50) # Step 5 : palse database = PALSE_xml_DB() database.load_from_directory(odir, ".*family_matching") # total score score = database.get_all_data_values_from_database("d_family_matching/solution/best_solution/score", int)[0][0] # stats on meta-clusters meta_cluster_sizes = database.get_all_data_values_from_database("d_family_matching/solution/best_solution/meta_cluster_sizes/item", int)[0] print("Score of the matching:",score) print("Num meta clusters:", len(meta_cluster_sizes)) print("Num of clusters in each meta-clusters:",meta_cluster_sizes) return odir
We now run the previous function for increasing values of D_max.¶
In particular, we analyze the incidence of the maximum diameter constraint. Given that the original input file contains 1000 points, note that the total score (num. of points found in meta-clusters) converges rapidly to the maximum possible balue when increasing D_max.

NB: caption for figures:

left plot: first clustering

middle plot: second clustering

right plot: meta-clustering
In [5]:

run_with_diameter_constraint(2)
Executing sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus --points-file data/points-N200-d50.txt --k-means-k 10 -o -d results-2 Executing sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus --points-file data/points-N200-d50.txt --k-means-k 20 -o -d results-2 Clustering comparison command: sbl-cmp-clust-dfam.exe -f results-2/sbl-cluster-k-means-euclid__f_points-N200-d50__mode_plusplus__iter_max_10__k_10__clusters.txt -f results-2/sbl-cluster-k-means-euclid__f_points-N200-d50__mode_plusplus__iter_max_10__k_20__clusters.txt --diameter-constraint 2 --num-iterations 100 -d results-2 Cluster display command for meta clustering: sbl-clusters-display.py -f data/points-N200-d50.txt -c results-2/final-clustering.txt -o results-2 Figs displayed XML: 1 / 1 files were loaded Score of the matching: 786 Num meta clusters: 9 Num of clusters in each meta-clusters: [4, 3, 3, 3, 3, 3, 2, 4, 5]

Out[5]:

'results-2'
In [6]:

run_with_diameter_constraint(4)
Executing sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus --points-file data/points-N200-d50.txt --k-means-k 10 -o -d results-4 Executing sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus --points-file data/points-N200-d50.txt --k-means-k 20 -o -d results-4 Clustering comparison command: sbl-cmp-clust-dfam.exe -f results-4/sbl-cluster-k-means-euclid__f_points-N200-d50__mode_plusplus__iter_max_10__k_10__clusters.txt -f results-4/sbl-cluster-k-means-euclid__f_points-N200-d50__mode_plusplus__iter_max_10__k_20__clusters.txt --diameter-constraint 4 --num-iterations 100 -d results-4 Cluster display command for meta clustering: sbl-clusters-display.py -f data/points-N200-d50.txt -c results-4/final-clustering.txt -o results-4 Figs displayed XML: 1 / 1 files were loaded Score of the matching: 956 Num meta clusters: 6 Num of clusters in each meta-clusters: [4, 6, 9, 3, 3, 5]

Out[6]:

'results-4'
In [7]:

run_with_diameter_constraint(8)
Executing sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus --points-file data/points-N200-d50.txt --k-means-k 10 -o -d results-8 Executing sbl-cluster-k-means-euclid.exe --k-means-selector=plusplus --points-file data/points-N200-d50.txt --k-means-k 20 -o -d results-8 Clustering comparison command: sbl-cmp-clust-dfam.exe -f results-8/sbl-cluster-k-means-euclid__f_points-N200-d50__mode_plusplus__iter_max_10__k_10__clusters.txt -f results-8/sbl-cluster-k-means-euclid__f_points-N200-d50__mode_plusplus__iter_max_10__k_20__clusters.txt --diameter-constraint 8 --num-iterations 100 -d results-8 Cluster display command for meta clustering: sbl-clusters-display.py -f data/points-N200-d50.txt -c results-8/final-clustering.txt -o results-8 Figs displayed XML: 1 / 1 files were loaded Score of the matching: 1000 Num meta clusters: 4 Num of clusters in each meta-clusters: [5, 6, 11, 8]

Out[7]:

'results-8'

The reader is referred to [46] for a detailed comparison of VI against the scores yielded by our comparison strategy.

Table of Contents

D_family_matching

Introduction

Algorithms

Terminology and pre-requisites

Problem statement

Problem hardness

Algorithms

Implementation and functionalities

Design

Functionalities

Examples

Applications

Programs

Main specifications

Jupyter demo

D_family_matching¶

Useful functions¶

Toy example: comparing two clusters¶

Options.¶

Comparing clusterings produced by kmeans¶

In the following, we perform the following steps:¶

We now run the previous function for increasing values of D_max.¶