Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.

Authors: F. Cazals and T. Dreyfus and D. Mazauric
This package package provides method to perform various comparisons of two samples energy landscapes, taking into account the location of local minima, their occupancy probabilities, and possible their connexsions as encoded in a transition graph.
Comparing two (sampled) energy landscapes is of interest in various settings, e.g., to assess the coherence of two force fields for a given system (atomic, coarse grained), to compare two related systems (e.g. a wild type and mutant protein), or simply to compare simulations launched with different initial conditions (and check whether the same regions in conformational space have been visited).
In comparing two landscapes, two categories of criteria are of interest, namely
features of the basins, in particular the local minima and their associated occupancy probabilities, called masses for the sake of genericity in the sequel.
This package provides methods to compare (sampled) energy landscapes, in two guises:
Earth Mover Distance: a comparison method exploiting solely the location of local minima, and the masses of the basins.
These functionalities are available in the programs and .
Landscapes and vertex weighted TG. In the sequel, we assume that the energy landscape (EL) is coded in a compressed transition graphs, as defined in section Energy landscapes . We also assume that each vertex is endowed with a mass, typically the volume or the occupancy probability of its catchment basin. The reader is also referred to the package dedicated to the construction of transition graph, see Transition_graph_of_energy_landscape_builders .
In the following we provide two comparison methods: the first one deals with features of the basins only, while the second one additionally exploits the information on transitions.
Basins and their masses.Consider the basin of a local minimum. Abusing notations, the footprint of the basin in the conformational space is also denoted . Denoting is the Boltzmann constant and the partition function.
For small systems, the mass of the system may be obtained by integrating the Boltzmann factor over the basin region, namely:
If minima are found by optimization (e.g., basin hopping), the can be estimated using the eigenvalues of the Hessian (curvature) matrix evaluated at those points [141] . Such a calculation amounts to focusing on the vibrational entropy of the system.
If, on the other hand, the samples are obtained from a thermodynamic ensemble, as is typically the case for molecular dynamics and Monte Carlo procedures, Boltzmann weighting is automatically satisfied; the basin weight can correspondingly be estimated from the number of points falling in the basin region:
Consider two landscapes, for which masses of the basins have been computed.
For the sake of exposure, we call these two landscapes the source landscape and the demand landscape . We also denote and , respectively, their numbers of basins.
The local minima and the associated basins of the source landscape are denoted and , respectively. For the demand landscape, local minima and basins are denoted and .
We also assume that the weights of the basins have been computed. Denoting these and for a source and demand basin respectively, we define the sum of weights and . Finally, so that the distance between the two aforementioned local minima is .
To compare the two landscapes, we use the earth mover distance [125] , which is a particular case of mass transportation [138] . Intuitively, the technique fills basins in the target (aka demand) landscape using mass from basins of the source landscape. A basin from can be split into several parts, and equivalently, a basin from can be filled from several basins from (Fig. figcomparisonofbasinsenergylandscapes). Denote the mass from moved into . The cost of moving units of mass depends linearly on the distances between the minima associated with and associated with . A transport plan from to is defined by triples , with . Note that there are at most such triples.
Finding the optimal i.e. least cost transport plan amounts to solving the following linear program (LP):
The first equation is the linear functional to be minimized, while the remaining ones define linear constraints. In particular, the second one expresses the fact that every basin from need to be filled, while the third one indicates that a basin from cannot provide more than it contains. (To simplify matters, we have assumed that . Handling the case poses no difficulty, and the reader is referred to [125] .)
Based on this linear program, we introduce the total number of edges, the total flow, the total cost, and their ratio, known as the earth mover distance [125] :
Comparing two energy landscapes and The landscape is partitioned into the basins associated to its local minima (three of them on this example), and likewise for (four local minima). Comparing the landscapes is phrased as a mass transportation problem on the bipartite graph defined by the two sets of minima. Note that sets of connected basins from the top landscape are mapped to connected basins of the bottom landscape. 
The previous comparison ignores transitions between local minima. To take these connections into account, we modify the method by imposing connectivity constraints to transport plans. To see whether a transport plan is valid, pick any connected subgraph from – that is connects selected local minima in . Let be the set of vertices from such that for each vertex in , there is at least one edge emanating from a vertex of the subgraph with . The transport plan is called valid iff the subgraph is also connected. We summarize this discussion with the following definition–see also (Fig. figemdLPnotedgeconnectivityviolated}:
An important remark is the following: a transport plan respecting connectivity constraints may not fully satisfy the demand. (It can actually be shown that there exist instance such that no transport plan fully satisfies the demand.)
Since EMDCC does not necessarily admit a solution fully satisfying the demand, we define the problem Earth Mover Distance with Cost and Connectivity Constraints :
Note that the previous definition calls for two algorithms:
algorithm : computes a transport plan given a upper bound on the transport cost.
The latter algorithm actually has 2 recursion modes. To see which, assume that an interval of possible costs is given. The two modes are:
Recursion mode: refined. Given two costs and , is called three times for the different costs , , and . Then:
a) If the two total volumes of flow with cost and are different, then is called with the interval .
Recursion mode: coarse. Similar to refined mode, except that only option b) is considered.
The solution of the linear program may not satisfy connectivity constraints A transport plan between a source and a demand graph each consisting of a linear chain of four vertices. The vertices of the edge of the source graph export towards the vertices and of the demand graph. The subgraph of the demand graph induced by these vertices is not connected – there is no edge linking to . 
The input are transition graphs of energy landscapes of the same molecule. They are produced by programs of the application Transition_graph_of_energy_landscape_builders, and stored in XML archives. A calculation is launched as follows:
> sblenergylandscapecomparisoneuclid.exe transitiongraph data/himmelbleau_transition_graph_noisy.xml transitiongraph data/himmelbleau_transition_graph_simplified.xml withconnectivityconstraints symmetricmode directory results verbose outputprefix log
File Name  Description 
XML transition graph  Transition graph of noisy energy landscape of Himmelbleau function 
XML transition graph  Transition graph of energy landscape of Himmelbleau function 
Preview  File Name  Description 
General: log file  
Log file  Log file containing high level information on the run of sblenergylandscapecomparisoneuclid.exe  
Module EMD comparisons: Earth Mover distance with and without connectivity constraints  
EMD XML archive  Low level information on the run of EMD from source to target  
EMD XML archive  Low level information on the run of EMD from target to source  
EMD graph graphviz file  Graph representation of EMD from source to target (see Graphviz)  
EMD graph graphviz file  Graph representation of EMD from target to source (see Graphviz) 
The input are transition graphs of energy landscapes of the same BLN 69 molecule. They are produced by programs of the application Transition_graph_of_energy_landscape_builders, and stored in XML archives. A calculation is launched as follows:
> sblenergylandscapecomparisonlrmsd.exe transitiongraph data/bln69_transition_graph.xml transitiongraph data/bln69_transition_graph.xml withconnectivityconstraints directory results verbose outputprefix log
File Name  Description 
XML transition graph  Transition graph of BLN69 energy landscape 
Preview  File Name  Description 
General: log file  
Log file  Log file containing high level information on the run of sblenergylandscapecomparisonlrmsd.exe  
Module EMD comparisons: Earth Mover distance with and without connectivity constraints  
EMD XML archive  Low level information on the run of EMD from source to target  
EMD graph graphviz file  Graph representation of EMD from source to target (see Graphviz) 
Once the weights of basins have been computed, solving the linear program of Eq. (eqemdlp) has polynomial complexity [84] . Practically, various solvers can be used, e.g. the one from the Computational Geometry Algorithms Library [37] , lp_solve, the CPLEX solver from IBM, etc. In the following, the algorithm solving the linear program of Eq. (eqemdlp) is called .
Finding transport plans respecting connectivity constraints turns out to be a hard combinatorial problem [28] . The problem is not in APX, which means that if holds, then, no polynomial algorithm with constant approximation factor exist.
Following definition deftpcc, we provide two algorithms:
: computes a transport plan respecting connectivity constraints, given a bound C.
Using the previous algorithm(s), in a manner analogous to Eq. (eqemdlpdist), we define the total number of edges, the total flow, the total cost, and their ratio:
The programs of Energy_landscape_comparison described above are based upon generic C++ classes, so that additional versions can easily be developed.
In order to derive other versions, there are two important ingredients, that are the workflow class, and its traits class.
T_Energy_landscape_comparison_traits:
This class defines the main types used in the modules of the workflow. It is templated by the classes of the concepts required by these modules. This design makes it possible to use the same workflow within different(biophysical) contexts to make new programs. To use the workflow T_Energy_landscape_comparison_workflow , one needs to define:
GeometricKernel  Traits class defining various geometric objects, in particular the number type used for representing the coordinates of a conformation, and the base representation of a point in dimension D – see class CGAL::Cartesian_d from the CGAL library. 
ELSample  Representation of a conformation enriched by a height that is typically its energy. 
Distance  Functor returning the distance between two conformations. It has to define the types Point representing the conformation, and the type FT representing the returned value type. 
T_Energy_landscape_comparison_interface_workflow: