Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
|
Authors: F. Cazals and T. Dreyfus
This package offers a set of topological analysis inspired by Morse theory, namely the study of real valued function defined on a topological space.
While Morse theory is classically defined in the differential setting ([127] , [28] or [16]), we are instead concerned with discrete structures (coded as graphs, see details below).
In all generality, the following constructions are of interest :
The critical points and their connections in co-dimension one (i.e., connexion from local minima to index one saddles, from index one saddles to index two saddles, etc).
The stable manifolds partition, that is a partition of the input points induced by the flow of non critical points toward critical points.
The disconnectivity forest, namely the forest representing the merge events observed when increasing the function value.
The persistence diagram of local minima when studying sub-level sets of the function.
Again, these notions are classically defined in the differential setting, so that the discrete setting requires the appropriate generalizations, in particular that of a (pseudo-)gradient [85] .
Notice again that these notions are classically defined in the differential setting, so that the discrete setting requires the appropriate generalizations, in particular that of a (pseudo-)gradient [85] .
For applications, of particular interest is the case of index 0 and index 1 critical points. When studying sub-level sets of smooth function, as done in Morse theory, connected components appear with local minima, and merge at selected index one saddles.
In the sequel, we focus on the aforementioned constructions which involve local minima and index one saddles only – more precisely substitutes defined in the discrete setting. In focusing on index 0 and 1 local minima, it turns out that the aforementioned pieces of information can be easily computed using a Union-find data structure [79]. In this package, we take a different route in this package, working with the Morse Smale Witten (MSW) chain complex, namely the bipartite graph connecting local minima and index one saddles [16]. see the package Morse_Smale_Witten_chain_complex .
Working with the MSW complex instead of the initial graphs indeed yields several benefits:
Practically, three types of inputs are supported:
Landscape simplification using topological persistence Simplification: strategy using Union-Find versus refined strategy using the Morse-Smale-Witten (MSW) complex. (A,B) Two landscapes, with critical points of indices 0 (disks) and 1 (squares). (C) The disconnectivity graph (DG) of both landscapes, namely the tree depicting the evolution of connected components of sublevel sets. Despite the differences between their MSW complexes, both landscapes share the same DG: upon passing the critical point e, the stable manifold of c merges with that born at a. (D,E) The MSW complexes of (A,B), respectively. In cancelling the pair of critical points (c, e), one does not know from the DG with which (a or b) the basin of c should be merged. But the required information is found in the MSW complex: on the landscape (A), c is merged with a; on the landscape (B), c is merged with b. |
The Morse based simplification can be applied in two rather different settings:
In performing the simplification and the re-assignment of points to stable manifold, the question arises to decide whether one should also reassign the stable manifold of saddle points or not.
For example, in simplifying a transition graph representing a molecular energy landscape, one may wish to assign to persistent basins local minima only – but not the saddle points they are connected to, as illustrated on Fig. Fig. fig-pers-tg for an illustration.
Persistence based simplification on a toy landscape with four local minima and four saddle points. At a persistence threshold p=0.5, only two minima survive, namely the global minimum and . Upon performing the simplification, the minima and flow into , and one is left with a landscape with two basins only: that of whose stable manifold is itself. There are two possibilities for the stable manifolds of these points:
|
There are two main packages :
This section describes the main class to be used in Core algorithms, and the modules to be used in applications.
The class SBL::GT::T_Morse_theory_based_analyzer has only one template parameter, that is MorseSmaleWittenChainComplex , a representation of the MSW chain complex provided by the package Morse_theory_based_analyzer. Its constructor takes a MSW chain complex and constructs the corresponding disconnectivity forest. It offers the following functionalities.
Critical points and their connections.
The method SBL::GT::T_Morse_theory_based_analyzer::get_paths returns the set of all triples (0-saddle, 1-saddle, 0-saddle) that exists in the underlying MSW chain complex : the saddle points are represented by their vertices in the input MSW chain complex.
Stable manifolds partition.
The method SBL::GT::T_Morse_theory_based_analyzer::get_stable_manifold_partition returns a partition of all the points associated to the different 0-saddles. It proceeds as follows :
This is done so that all points used for building the MSW chain complex are segregated into the stable manifolds of local minima. Note that points associated to saddles of strictly positive index may be arbitrarily assigned to different stable manifolds. To avoid duplication of such points, an arbitrary choice is made, that is, each such point is ascribed to a chosen stable manifold.
Disconnectivity forest.
The disconnectivity forest is the main data structure for understanding how index one saddles are linked together. While processing the saddles of the MSW chain complex by increasing function value, the connected components of points associated to local minima merge. The internal vertices of the disconnectivity forest are exactly the index one saddles from which two different connected components merge.
More precisely, consider the sub-level set of all points that are below a threshold: those points make different components associated to different local minima. When the threshold is under the elevation of the lowest local minimum, the sub-level set is empty. When rising the threshold value, two events may occur :
When a point which is not a saddle has an associated value passing under the threshold, it is simply added to the connected component associated to the saddle of this point : there is no change in the disconnectivity forest.
The class SBL::GT::T_Morse_theory_based_analyzer provides the following operations on the disconnectivity forest :
Persistence diagram.
The persistence of a saddle is the measure of the lifetime of the homological feature associated with this saddle. For example, the persistence of a local minimum is the lifetime of the connected component which gets created at the function value of the minimum.
When two components merge, the smallest component dies while the largest component grows. Hence, the persistence of a saddle s is the difference between the value associated to the saddle where the component of s dies, and the value of s itself. The persistence interval of a saddle is just defined by these two values. When a saddle cannot be merged anymore, it has an infinite persistence.
The persistence diagram is defined as the list of pairs of vertices in the disconnectivity forest such that a component is created at the value of the first vertex, and dies at the value of the second vertex. The list is sorted by ascending values of persistences. Three operations are available on persistence diagrams:
Sub-level sets assignment.
A sub-level set is a component of points as defined previously where all the points are below a given threshold. Hence, all points whose values are below the threshold are assigned to a sub-level set. Assignment refers to the fact that points above the threshold are not included in the sub-level sets, so that sub-level sets are not a partition of all the input points of the MSW chain complex. Accessing the sub-level sets assignment for a given threshold value is done using the method SBL::GT::T_Morse_theory_based_analyzer::get_sublevelsets_connected_components
Simplification.
While all the previous operations aim to access specific analysis on the MSW chain complex, the simplification operation aims to remove the least persistent saddles from the analysis. In particular, it helps to eliminate noise and spurious saddles, so to have more smooth analysis of the MSW chain complex. See Morse_Smale_Witten_chain_complex for a full description of the simplification algorithm. To simplify the MSW chain complex so as to eliminate all saddles having a persistence under a given threshold, one has to use the method SBL::GT::T_Morse_theory_based_analyzer::simplify .
Note that this method simplifies the MSW chain complex and recompute the disconnectivity forest and the persistence diagram.
Base module.
The module SBL::Modules::T_Morse_theory_based_analyzer_module< ModuleTraits , MorseSmaleWittenChainComplex > is a generic class computing the different described analysis over an input MSW chain. The template parameter MorseSmaleWittenChainComplex is the representation of the MSW chain complex and should be an instantiation of the class SBL::GT::T_Morse_Smale_Witten_chain_complex . When the template parameter ModuleTraits defines the type Morse_Smale_Witten_chain_complex, there is no need of specifying this second template parameter.
In addition to the previously described analysis, the module provides command-line options for customizing the analysis :
NB: when this option is used, selected samples are filtered out from stable manifolds whose union is computed to define the connected components of sublevel sets. Results are stored in files prefixed with subleveltset_cluster, not to be confused with files prefixed with stable_manifold_cluster (see below).
NB: when this option is used, stable manifolds, aka clusters, are stored in output files prefixed with stable_manifold_cluster.
Specialized modules.
Remind that the MSW chain complex can be built from a Nearest Neighbors Graph (NNG), a Weighted Graph (WG), or a Vertex Weighted Graph (VWG). The previous module is specialized for each possible input graph :
For example, the weight of a vertex can be the minimum weight over its incident edges, minus a user-defined epsilon value. In all cases, the the value of a vertex is required to be less than the values of its incident edges.
Note that in using a VWG, weights on edges are not required. If undefined, the maximum weight over the incident vertices is used.
The following example shows how to manually build a MSW chain complex and perform analysis over it. It creates an abstract landscape as a graph where vertices are the critical points of the landscape and edges link neighbors in the landscape. Critical points are actually abstract, meaning that they have no geometrical representation (i.e they are just vertices in a graph). Then, it builds the MSW chain complex by adding the saddles and linking them. Finally, it builds the analyzer engine and displays the different analysis.
This example shows how to use the module SBL::Modules::T_Morse_theory_based_analyzer_for_NNG_module for analyzing an input set of 2D points elevated by input weights. It first computes a NNG of the input set of points, and then computes the MSW chain complex from which the analysis are made.
This example shows how to use the module SBL::Modules::T_Morse_theory_based_analyzer_for_weighted_graph_module for analyzing an input WG, that is an edge weighted graph. Note that the input WG does not define weights over the vertices, but the algorithm requires such weights. Hence, it is the functor Get_weight that defines the weight of a vertex as the minimal weight among its incident edges.
This example shows how to use the module SBL::Modules::T_Morse_theory_based_analyzer_for_vertex_weighted_graph_module for analyzing an input VWG. Note that the input VWG does not define weights over the edges, and the algorithm does not require such weights since they can be estimated from the weights over the vertices.
This package offers also a number of programs applying the analysis to the following cases :
Below, we list of the specific options of the various executables.
Main options: sbl-Morse-theory-based-analyzer-nng-euclid-density-gauss.exe The main options are:
Main options: sbl-Morse-theory-based-analyzer-nng-euclid-density-gauss.exe
The options are identical to those just described. The only difference is that the elevation is give – as opposed to computed from a density estimation:
Main options: sbl-Morse-theory-based-analyzer-wg.exe
The main options are:
Input. See above.
Main output. The main file generated, common to all programs, are the following ones:
See the following jupyter notebook:
from SBL import sbl_jupyter as sblj
help(sblj)
As a first illustration, we analyze the density estimated from a point cloud (in 2D here). Using persisstence, we select the most persistent modes of the estimated density, and retrun one cluster for each mode.
import os
import shutil # python 3 only
import matplotlib.pyplot as plt
import matplotlib.image as mplimg
from SBL import sbl_jupyter as sblj
def analyze_density(ifile, num_neighbors = 8, persistence = -1):
odir = "results-dens"
if os.path.exists(odir):
os.system("rm -rf %s" % odir)
os.system( ("mkdir %s" % odir) )
exe = shutil.which("sbl-Morse-theory-based-analyzer-nng-euclid-density-gauss.exe")
if exe:
cmd = "%s --points-file %s --num-neighbors %d --persistence-threshold %.2f \
-d %s" % (exe, ifile, num_neighbors, persistence, odir)
print(cmd)
os.system(cmd)
# find the partition of input points into clusters
cmd = "find %s -name \"*stable_manifold_partition.xml\"" % odir
print(cmd)
xml_file = os.popen(cmd).readlines()[0].rstrip()
# generate the image
cmd = "sbl-clusters-display.py -f %s -c %s -o %s" % (ifile, xml_file,odir)
print(cmd)
os.system(cmd)
# display images
images = []
images.append( sblj.tools.find_and_convert("stable_manifold_partition.png", odir, 100) )
images.append( sblj.tools.find_and_convert("disconnectivity_forest.eps", odir, 100) )
images.append( sblj.tools.find_and_convert("persistence_diagram.pdf", odir, 100) )
sblj.tools.show_row_n_images(images, 100)
else:
print("Executable not found")
analyze_density("data/points-N200-d50.txt", persistence = -1)
analyze_density("data/points-N200-d50.txt", persistence = 0.75)
We process a graph with weights on edges. Note that edges of the graph only are required. An edge is a pair of vertices represented by their indices, so that vertices are implicitely created from edge declarations.
import sys #misc system
import os
import pdb
import shutil # python 3 only
def analyze_weighted_graph(edges, weights, vertexWeights = None, persistence = -1):
odir = "results-wg"
if os.path.exists(odir):
os.system("rm -rf %s" % odir)
os.system( ("mkdir %s" % odir) )
# check executable exists and is visible
exe = shutil.which("sbl-Morse-theory-based-analyzer-wg.exe")
if exe:
print(("Using executable %s\n" % exe))
cmd = "sbl-Morse-theory-based-analyzer-wg.exe -v -l -d %s\
--edges-file %s --edge-weights-file %s --persistence-threshold %f %\
(odir, edges, weights, persistence)
if vertexWeights:
cmd += " --vertex-weights-file %s" % (vertexWeights)
os.system(cmd)
cmd = "ls %s" % odir
ofnames = os.popen(cmd).readlines()
print("All output files:",ofnames)
# display_log_file(odir)
images = []
images.append( sblj.tools.find_and_convert("disconnectivity_forest.eps", odir, 150) )
images.append( sblj.tools.find_and_convert("persistence_diagram.pdf", odir, 150) )
sblj.tools.show_row_n_images(images, 150)
else:
print("Executable not found")
analyze_weighted_graph("data/bln69_transitions.txt", "data/bln69_transitions_heights.txt")
#analyze_weighted_graph("data/bln69_transitions.txt", "data/bln69_transitions_heights.txt", "data/bln69_sampling_heights.txt")