Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.

Authors: F. Cazals and T. Dreyfus and C. Roth
This package provides various methods to analyze sampled energy landscapes, with in particular various statistics on its transition and connectivity graphs.
The importance of recognizing archetypal energy landscapes has long been acknowledged [112], in particular in the context of assessing the dynamics of the systems studied [44] . In this perspective, this package proposes various methods to characterize sampled energy landscapes.
More specifically, consider a sampled energy landscape, that is, a set of conformations, each equipped with an energy–see general concepts in section Energy landscapes . The following functionalities are provided on transition graphs (see def. deftransitiongraph) and disconnectivity graphs (see def. defDG):
Transition graph: computation of topological properties (Betti Numbers) and metric properties (path lengths),
Transition graph: Morse based analysis, including persistence analysis for local minima and distribution of barriers' heights.
Disconnectivity graphs: computation of morphological properties (internal path length),
These functionalities are available in the programs and .
A variety of algorithms can be used to explore energy landscapes, and their results can be turned into a compressed transition graph – definition deftransitiongraph, using e.g. methods presented in the package Transition_graph_of_energy_landscape_builders.
In the sequel, we are interested in topological and geometric information characterizing such graphs.
Topological information. For a given graph, the number of connected components and the number of cycles in this graph are known as the first and second Betti numbers [54] , and are respectively denoted and . We report them for the transition graph, and we also compute statistics on a per connected component basis.
The presence of cycles in a transition graph is especially interesting, since such cycles correspond to different paths connecting the same local minima. Cycles involving two edges are also of interest, as they may arise in two situations. The first corresponds to the situation in which a loop around a saddle is anchored in a single local minimum (Fig. figbumpmiddleslopebassin). The second is due to postsampling processing techniques that may be used to group minima. For instance, in the BLN model, enantiomeric left and right handed helices may be considered as identical in certain contexts; identifying them as such results in the creation of a cycle.
Geometric information. Assume a distance between conformation has been chosen. The simplest geometric information embodied in a transition graph is the length of its edges, and can be assessed with the statistical summary of Eq. (eqstattripleedgelengths).
In particular, the presence of long edges hints at challenging landscapes: even when the between all pairs of local minima is small, one may have to follow a long valley to reach a saddle connecting two minima.
Describing the topology of a transition graph with the first and second Betti numbers This fictitious examples features a TG with two connected components: the first one involves four local minima and five saddles; the second one involves two local minima and two saddles. The Betti numbers are: (two connected components) and (three cycles). 
A loop around a saddle may be anchored in a single local minimum — a.k.a bump transition Note that the dotted path is located behind the bump of this fictitious 2D landscape. 
Consider a transition graph, and assume that selected minima called landmarks have been singled out. For example, one may select the lowest local minima or the most persistent ones (package Morse_theory_based_analyzer).
For a given landmark, the distribution of distances from that landmark to its connected saddles provides an estimate on the size of the basin of that landmark [50] . Also, a large value for the smallest such distance provides information on the distance to be traveled to escape from the basin of that minimum. Similarly, the relative position of all landmarks is assessed by the statistical summary of all pairwise distances.
Morse theory and topological persistence provide a number of concepts and tools which are highly relevant to study energy landscapes. We describe a few of them in the sequel, and refer the reader to the terminology section Terminology and Concepts for a quick recap on central notions such as sublevel set, key saddle, MorseSmaleWitten complex.
Of particular interest is the pairing between local minima and their key saddles (see def. defkeysaddle). This pairing, and more specifically the elevation drop between the key saddle and its minimum indeed defines the barrier height. The elevation of the minimum of that of its key saddle may also be seen as the birth date and the death date of the catchment basin of the local minimum: when scanning the landscape with a horizontal plane, the basin appears at the elevation of the local minimum, and disappears i.e. merges with another basin at the elevation of its key saddle. Putting together all points (birth date, death date) defines the socalled persistence diagram (PD), as detailed in the package Morse_theory_based_analyzer.
Persistence diagrams for sublevel sets. The analysis presented above for local maxima and super level sets applies directly for local minima and sublevel sets, yielding a persistence diagram whose points code local minima of the height field. The executive summary concerning the PD of a landscape goes as follows:
By construction, the lowest local minimum is not paired with any saddle. That is, if the height field has local minima, its PD contains points.
A point of the PD corresponds to a pairing involving one local minimum and one saddle. Mathematically, these points are critical points of the height field: the minimum is an index c.p. (its Hessian has zero negative eigenvalues), while the saddle is an index one c.p. (its Hessian has one negative eigenvalue).
All points of the PD are above the diagonal , and the distance from a point to the diagonal is a measure of the stability of the corresponding local minimum.
The disconnectivity graph (DG) of a landscape describes the merge events of catchment basins, which occur at key saddles (see also def. defDG).
The generic situation expected in a DG is that of a binary merge, namely a saddle is linked to two local minima. (Degenerate situations may occur, though: for example, a monkey saddle is linked to three local minima.) That is a generic DG is a binary tree: each internal node representing a saddle is associated with at most two children, and each leaf represents a local minimum.
The morphology of a complete binary tree can vary between that of a path (in which one child of each internal node is a leaf and the other is an internal node) and that of a perfectly balanced tree (in which each internal node has two children i.e. subtrees of the same size). To assess the structure of such trees, we use the socalled external path length (EPL) [82] . Defining the depth of a node as the number of edges required to reach it from the root, the EPL is the sum of depths of the leaves of the tree. Likewise, the internal path length (IPL) is the sum of the depth of internal nodes. If is the number of internal nodes (index one saddles in our case), the EPL when the tree is a path is , which we will write using the shorthand , while that of a perfectly balanced tree is . Since the latter, symmetrical tree is not a realistic reference, we use instead the EPL of a random binary search tree [82] , which is asymptotically equivalent to .
Internal and external path lengths of binary trees (IPL and EPL) The IPL counts the number of edges traversed, from the root, to reach internal nodes (disks); in the context of disconnectivity graphs for energy landscapes, recall that these refer to key saddles. The EPL counts the number of edges traversed to reach the leaves of the tree; in the context of DG, these refer to local minima. 
The main input is a transition graph of a sampled energy landscape, produced by the programs of the application Transition_graph_of_energy_landscape_builders as XML archives. If shortest paths with landmarks are desired, it is required to load a file listing the vertices of the input transition graph that are the landmarks. A strategy for producing a landmarks file consists on comparing the input transition graph with a reference transition graph containing only reference conformations. Programs of the application Energy_landscape_comparison produce a Graphviz file showing the correspondence between the vertices of two input transition graphs. This allows to visually select the vertices of an input graph that will be landmarks.
A calculation is launched as follows:
> sblenergylandscapeanalysiseuclid.exe transitiongraph data/himmelbleau_transition_graph_noisy.xml topology Morse landmarks landmarksfilename data/himmelbleau_landmarks.txt directory results verbose outputprefix log
File Name  Description 
XML transition graph  Transition graph of energy landscape of Himmelbleau function 
Landmarks text file  Landmark vertices obtained after running programs of Energy_landscape_comparison 
Preview  File Name  Description 
General: log file  
Log file  Log file containing high level information on the run of  
Module topological analysis: topology and Betti numbers  
Topological analysis xml file  Topological analysis serialized using Boost into an XML archive  
Module landmark paths analysis: paths through landmarks on transition graphs  
Landmarks xml file  List of shortest paths passing through the landmarks  
Module persistence based topographical analysis: Morse based theory analysis, disconnectivy graph morphology  
Morse Smale Witten chain complex xml file  Morse Smale Witten chain complex serialized using Boost into an XML archive  
Stable manifold partition xml file  XML archive listing the samples repartition by persistent basin  
Disconnectivity forest image file  Disconnectivity forest drawn in eps file format  
Sorted basins plain text file  List of all basins sorted by persistence  
Persistence diagram plot script  Gnuplot script for the persistence diagram  
Persistence diagram image  Persistence diagram drawn by gnuplot in pdf file format  
Persistences plain text file  List of all finite persistences  
Persistence histogram plot script  R script for the persistence histogram  
Persistence histogram image  Persistence histogram drawn by R in pdf file format 
The main input is a transition graph of a sampled energy landscape of BLN 69, produced by the programs of the application Transition_graph_of_energy_landscape_builders as XML archives. A calculation is launched as follows:
> sblenergylandscapeanalysislrmsd.exe transitiongraph data/bln69_transition_graph.xml topology Morse directory results verbose outputprefix log
File Name  Description 
XML transition graph  Transition graph of energy landscape of BLN 69 
Preview  File Name  Description 
General: log file  
Log file  Log file containing high level information on the run of  
Module topological analysis: topology and Betti numbers  
Topological analysis xml file  Topological analysis serialized using Boost into an XML archive  
Module persistence based topographical analysis: Morse based theory analysis, disconnectivy graph morphology  
Morse Smale Witten chain complex xml file  Morse Smale Witten chain complex serialized using Boost into an XML archive  
Stable manifold partition xml file  XML archive listing the samples repartition by persistent basin  
Stable manifold partition xml file  XML archive listing the samples repartition by persistent basin  
Disconnectivity forest image file  Disconnectivity forest drawn in eps file format  
Sorted basins plain text file  List of all basins sorted by persistence  
Persistence diagram plot script  Gnuplot script for the persistence diagram  
Persistence diagram image  Persistence diagram drawn by gnuplot in pdf file format  
Persistences plain text file  List of all finite persistences  
Persistence histogram plot script  R script for the persistence histogram  
Persistence histogram image  Persistence histogram drawn by R in pdf file format 
Computing the first and second Betti number of a graph respectively require running a traversal of the graph and a unionfind algorithm [54], [41] .
Computing the distances between landmarks can be done using a variant of Dijkstra's shortest path algorithm, where one walks on the graph provided, but records and relaxes distances between landmarks only [41] .
The programs of Energy_landscape_analysis described above are based upon generic C++ classes, so that additional versions can easily be developed.
In order to derive other versions, there are two important ingredients, that are the workflow class, and its traits class.
T_Energy_landscape_analysis_traits:
This class defines the main types used in the modules of the workflow. It is templated by the classes of the concepts required by these modules. This design makes it possible to use the same workflow within different(biophysical) contexts to make new programs. To use the workflow T_Energy_landscape_analysis_workflow , one needs to define:
GeometricKernel  Traits class defining various geometric objects, in particular the number type used for representing the coordinates of a conformation, and the base representation of a point in dimension D – see class CGAL::Cartesian_d from the CGAL library 
ELSample  Representation of a conformation enriched by a height that is typically its energy. 
Distance  Functor returning the distance between two conformations. It has to define the types Point representing the conformation, and the type FT representing the returned value type. 
T_Energy_landscape_analysis_interface_workflow: