Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Authors: F. Cazals and T. Dreyfus and C. Roth

Energy_landscape_analysis

This package provides various methods to analyze sampled energy landscapes, with in particular various statistics on its transition and connectivity graphs.

Goals

The importance of recognizing archetypal energy landscapes has long been acknowledged [112], in particular in the context of assessing the dynamics of the systems studied [44] . In this perspective, this package proposes various methods to characterize sampled energy landscapes.

More specifically, consider a sampled energy landscape, that is, a set of conformations, each equipped with an energy–see general concepts in section Energy landscapes . The following functionalities are provided on transition graphs (see def. def-transition-graph) and disconnectivity graphs (see def. def-DG):

  • Transition graph: computation of topological properties (Betti Numbers) and metric properties (path lengths),

  • Transition graph: Morse based analysis, including persistence analysis for local minima and distribution of barriers' heights.

  • Disconnectivity graphs: computation of morphological properties (internal path length),

These functionalities are available in the programs $ \sblELAL $ and $ \sblELAE $.

Using sbl-energy-landscape-analysis-euclid.exe

Pre-requisites

Transition graphs: topology and Betti Numbers

A variety of algorithms can be used to explore energy landscapes, and their results can be turned into a compressed transition graph – definition def-transition-graph, using e.g. methods presented in the package Transition_graph_of_energy_landscape_builders.

In the sequel, we are interested in topological and geometric information characterizing such graphs.

Topological information. For a given graph, the number of connected components and the number of cycles in this graph are known as the first and second Betti numbers [54] , and are respectively denoted $ \beta_0 $ and $ \beta_1 $. We report them for the transition graph, and we also compute statistics on a per connected component basis.

The presence of cycles in a transition graph is especially interesting, since such cycles correspond to different paths connecting the same local minima. Cycles involving two edges are also of interest, as they may arise in two situations. The first corresponds to the situation in which a loop around a saddle is anchored in a single local minimum (Fig. fig-bump-middle-slope-bassin). The second is due to post-sampling processing techniques that may be used to group minima. For instance, in the BLN model, enantiomeric left and right handed helices may be considered as identical in certain contexts; identifying them as such results in the creation of a cycle.

Geometric information. Assume a distance $ \dCalC $ between conformation has been chosen. The simplest geometric information embodied in a transition graph is the length of its edges, and can be assessed with the statistical summary of Eq. (eq-stat-triple-edge-lengths).

In particular, the presence of long edges hints at challenging landscapes: even when the $ \dCalC $ between all pairs of local minima is small, one may have to follow a long valley to reach a saddle connecting two minima.

Describing the topology of a transition graph with the first and second Betti numbers
This fictitious examples features a TG with two connected components: the first one involves four local minima and five saddles; the second one involves two local minima and two saddles. The Betti numbers are: $ \beta_0=2 $ (two connected components) and $ \beta_1=3 $ (three cycles).

A loop around a saddle $ s $ may be anchored in a single local minimum $ m $ — a.k.a bump transition
Note that the dotted path is located behind the bump of this fictitious 2D landscape.

Transition graphs: paths from minima to saddles

Consider a transition graph, and assume that selected minima called landmarks have been singled out. For example, one may select the lowest local minima or the most persistent ones (package Morse_theory_based_analyzer).

For a given landmark, the distribution of distances from that landmark to its connected saddles provides an estimate on the size of the basin of that landmark [50] . Also, a large value for the smallest such distance provides information on the distance to be traveled to escape from the basin of that minimum. Similarly, the relative position of all landmarks is assessed by the statistical summary of all pairwise distances.

The shortest route between two landmarks is not always the easiest one, as in particular it may pass through steep sections of the landscape. We therefore offer the possibility to filter out a TG in several ways, before computing the statistics just defined
  • by removing vertices above a certain height
  • by removing edges whose gradient magnitude is beyond a certain threshold.


Morse based analysis

Morse theory and topological persistence provide a number of concepts and tools which are highly relevant to study energy landscapes. We describe a few of them in the sequel, and refer the reader to the terminology section Terminology and Concepts for a quick recap on central notions such as sublevel set, key saddle, Morse-Smale-Witten complex.

Of particular interest is the pairing between local minima and their key saddles (see def. def-key-saddle). This pairing, and more specifically the elevation drop between the key saddle and its minimum indeed defines the barrier height. The elevation of the minimum of that of its key saddle may also be seen as the birth date and the death date of the catchment basin of the local minimum: when scanning the landscape with a horizontal plane, the basin appears at the elevation of the local minimum, and disappears i.e. merges with another basin at the elevation of its key saddle. Putting together all points (birth date, death date) defines the so-called persistence diagram (PD), as detailed in the package Morse_theory_based_analyzer.

Persistence diagrams for sublevel sets. The analysis presented above for local maxima and super level sets applies directly for local minima and sublevel sets, yielding a persistence diagram whose points code local minima of the height field. The executive summary concerning the PD of a landscape goes as follows:

  • By construction, the lowest local minimum is not paired with any saddle. That is, if the height field has $ m $ local minima, its PD contains $ m-1 $ points.

  • A point of the PD corresponds to a pairing involving one local minimum and one saddle. Mathematically, these points are critical points of the height field: the minimum is an index $ 0 $ c.p. (its Hessian has zero negative eigenvalues), while the saddle is an index one c.p. (its Hessian has one negative eigenvalue).

  • All points of the PD are above the diagonal $ y=x $, and the distance from a point to the diagonal is a measure of the stability of the corresponding local minimum.

  • A PD can also be simplified by re-assigning the basins of non significant minima. That is, if the points of the PD are clearly separated, those near the diagonal can be removed by merging each such basin into the deeper basin it is connected to. Note that this simplification removes the least persistent basins first, as opposed to merging the saddles in a given energy slice. Note also that this simplification cancels one (saddle, minimum) pair at a time.

Disconnectivity graphs: morphology

The disconnectivity graph (DG) of a landscape describes the merge events of catchment basins, which occur at key saddles (see also def. def-DG).

The generic situation expected in a DG is that of a binary merge, namely a saddle is linked to two local minima. (Degenerate situations may occur, though: for example, a monkey saddle is linked to three local minima.) That is a generic DG is a binary tree: each internal node representing a saddle is associated with at most two children, and each leaf represents a local minimum.

The morphology of a complete binary tree can vary between that of a path (in which one child of each internal node is a leaf and the other is an internal node) and that of a perfectly balanced tree (in which each internal node has two children i.e. subtrees of the same size). To assess the structure of such trees, we use the so-called external path length (EPL) [82] . Defining the depth of a node as the number of edges required to reach it from the root, the EPL is the sum of depths of the leaves of the tree. Likewise, the internal path length (IPL) is the sum of the depth of internal nodes. If $ n $ is the number of internal nodes (index one saddles in our case), the EPL when the tree is a path is $ \binom{n}{2}+2n $, which we will write using the shorthand $ \eplPath $, while that of a perfectly balanced tree is $ n\log_2 n $. Since the latter, symmetrical tree is not a realistic reference, we use instead the EPL of a random binary search tree [82] , which is asymptotically equivalent to $ \eplRand \sim 2n\ln n $.

Internal and external path lengths of binary trees (IPL and EPL)
The IPL counts the number of edges traversed, from the root, to reach internal nodes (disks); in the context of disconnectivity graphs for energy landscapes, recall that these refer to key saddles. The EPL counts the number of edges traversed to reach the leaves of the tree; in the context of DG, these refer to local minima.

Input: Specifications and File Types

The main input is a transition graph of a sampled energy landscape, produced by the programs of the application Transition_graph_of_energy_landscape_builders as XML archives. If shortest paths with landmarks are desired, it is required to load a file listing the vertices of the input transition graph that are the landmarks. A strategy for producing a landmarks file consists on comparing the input transition graph with a reference transition graph containing only reference conformations. Programs of the application Energy_landscape_comparison produce a Graphviz file showing the correspondence between the vertices of two input transition graphs. This allows to visually select the vertices of an input graph that will be landmarks.

A calculation is launched as follows:

> sbl-energy-landscape-analysis-euclid.exe --transition-graph data/himmelbleau_transition_graph_noisy.xml --topology --Morse --landmarks --landmarks-filename data/himmelbleau_landmarks.txt --directory results --verbose --output-prefix --log
The main options of the program $ \sblELAE $ are:
–transition-graph string: xml file representing the compressed transition graph of a sampled energy landscape
–topology: run topology analysis over the uncompressed transition graph
–Morse: run Morse theory based analysis over the uncompressed transition graph with the potential energy as Morse function
–landmarks: run shortest paths analysis through input landmark vertices on the compressed transition graph
–landmarks-filename string: file listing the landmark indices in the compressed transition graph


File Name

Description

XML transition graph Transition graph of energy landscape of Himmelbleau function
Landmarks text file

Landmark vertices obtained after running programs of Energy_landscape_comparison

Input files for the run described in section Input: Specifications and File Types .

Output: Specifications and File Types

PreviewFile Name

Description

General: log file

Log file

Log file containing high level information on the run of $ \sblELAE $

Module topological analysis: topology and Betti numbers

Topological analysis xml file

Topological analysis serialized using Boost into an XML archive

Module landmark paths analysis: paths through landmarks on transition graphs

Landmarks xml file

List of shortest paths passing through the landmarks

Module persistence based topographical analysis: Morse based theory analysis, disconnectivy graph morphology

Morse Smale Witten chain complex xml file

Morse Smale Witten chain complex serialized using Boost into an XML archive

Stable manifold partition xml file

XML archive listing the samples repartition by persistent basin

Disconnectivity forest image file

Disconnectivity forest drawn in eps file format

Sorted basins plain text file

List of all basins sorted by persistence

Persistence diagram plot script

Gnuplot script for the persistence diagram

Persistence diagram image

Persistence diagram drawn by gnuplot in pdf file format

Persistences plain text file

List of all finite persistences

Persistence histogram plot script

R script for the persistence histogram

Persistence histogram image

Persistence histogram drawn by R in pdf file format

Output files for the runs described in section Input: Specifications and File Types, classified by modules – see Fig. fig-ela-workflow .

Using sbl-energy-landscape-analysis-lrmsd.exe

Input: Specifications and File Types

The main input is a transition graph of a sampled energy landscape of BLN 69, produced by the programs of the application Transition_graph_of_energy_landscape_builders as XML archives. A calculation is launched as follows:

> sbl-energy-landscape-analysis-lrmsd.exe --transition-graph data/bln69_transition_graph.xml --topology --Morse --directory results --verbose --output-prefix --log
The main options of the program $ \sblELAL $ are:
–transition-graph string: xml file representing the compressed transition graph of a sampled energy landscape
–topology: run topology analysis over the uncompressed transition graph
–Morse: run Morse theory based analysis over the uncompressed transition graph with the potential energy as Morse function


File Name

Description

XML transition graph

Transition graph of energy landscape of BLN 69

Input files for the run described in section Input: Specifications and File Types .

Output: Specifications and File Types

PreviewFile Name

Description

General: log file

Log file

Log file containing high level information on the run of $ \sblELAL $

Module topological analysis: topology and Betti numbers

Topological analysis xml file

Topological analysis serialized using Boost into an XML archive

Module persistence based topographical analysis: Morse based theory analysis, disconnectivy graph morphology

Morse Smale Witten chain complex xml file

Morse Smale Witten chain complex serialized using Boost into an XML archive

Stable manifold partition xml file

XML archive listing the samples repartition by persistent basin

Stable manifold partition xml file

XML archive listing the samples repartition by persistent basin

Disconnectivity forest image file

Disconnectivity forest drawn in eps file format

Sorted basins plain text file

List of all basins sorted by persistence

Persistence diagram plot script

Gnuplot script for the persistence diagram

Persistence diagram image

Persistence diagram drawn by gnuplot in pdf file format

Persistences plain text file

List of all finite persistences

Persistence histogram plot script

R script for the persistence histogram

Persistence histogram image

Persistence histogram drawn by R in pdf file format

Output files for the runs described in section Input: Specifications and File Types, classified by modules – see Fig. fig-ela-workflow .

Algorithms and Methods

Transition graphs: topology and Betti Numbers

Computing the first and second Betti number of a graph respectively require running a traversal of the graph and a union-find algorithm [54], [41] .

The classical construction of a transition graphs resorts to transition path sampling algorithms to connect local minima. We offer an alternative calculation merely requiring a set of samples and their quenched counterparts, based on a variant of the algorithm explained in section Morse based analysis .


Transition graphs: paths

Computing the distances between landmarks can be done using a variant of Dijkstra's shortest path algorithm, where one walks on the graph provided, but records and relaxes distances between landmarks only [41] .

To compute statistics on a filtered graph one applies Dijkstra's (or its variant with landmark), on a filtered graph which is the graph induced by the vertices which have not been filtered out on the height criterion, and where selected edges have possibly been removed.


Programmer's Workflow

The programs of Energy_landscape_analysis described above are based upon generic C++ classes, so that additional versions can easily be developed.

In order to derive other versions, there are two important ingredients, that are the workflow class, and its traits class.

The Traits Class

T_Energy_landscape_analysis_traits:

This class defines the main types used in the modules of the workflow. It is templated by the classes of the concepts required by these modules. This design makes it possible to use the same workflow within different(biophysical) contexts to make new programs. To use the workflow T_Energy_landscape_analysis_workflow , one needs to define:

  • what is the representation of a conformation (e.g number type of coordinates)
Template Parameters
GeometricKernelTraits class defining various geometric objects, in particular the number type used for representing the coordinates of a conformation, and the base representation of a point in dimension D – see class CGAL::Cartesian_d from the CGAL library
ELSampleRepresentation of a conformation enriched by a height that is typically its energy.
DistanceFunctor returning the distance between two conformations. It has to define the types Point representing the conformation, and the type FT representing the returned value type.

The Workflow Class

T_Energy_landscape_analysis_interface_workflow: