Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.

Authors: F. Cazals and T. Dreyfus and C. Robert and A. Roth
In this package, we consider the problem of sampling a landscape, that is a height function defined as the real valued multivariate function. In biophysics, we notice, though, that this function typically codes the potential energy of a molecular system, so that the landscape is the potential energy landscape (PEL).
The algorithms presented fall in the realm of Monte Carlo based methods, in a broad sense. The precise meaning of sampling shall depend on the specific algorithms presented, as one may want in particular to:
Locate the low lying local minima of the function, with potentially a focus on the global minimum.
We actually provide a generic framework to develop such algorithms, providing instantiations for classical algorithms as well as algorithms developed recently.
Of particular interest in the sequel are the following three algorithms:
: the basin hopping algorithm, which performs a random walk in the space of local minima [96] , [139] .
: the rapidly exploring random trees algorithm, which favors the exploration of yet unexplored space based upon the Voronoi bias [80] .
Practically, our main focus is on landscapes encoding the potential energy of a molecular system. Yet, our algorithms also handle the case of a multivariate realvalued function void of biophysical meaning. The ability to handle such diverse cases owes to a generic framework. We illustrate it by providing explorer in three settings:
For a protein model known as BLN69, defined over a configuration space of dimension 3x69=207, see section BLN69,
For small peptides such as pentalanine using the Amber force field, defined over a configuration space of variable dimension (3x53=159 for pentalanine),
Needless to stress that these are simple examples, and that instantiations of our exploration algorithms are easily derived by instantiating our generic classes.
Comparing the explorations on the trigonometric terrain Top.Trigonometric terrain to be explored : Bottom. From Left to Right : Basins Hopping, TRRT and Hybrid. 50 minima (samples for TRRT) were generated. For TRRT, the initial displace delta was 0.5, while for Basins Hopping and Hybrid, it was 0.0001. For the Hybrid algorithm, Basins Hopping is used 10 times, then TRRT is used once : the consecutive minima generated by Basins Hopping are linked with rays of the same color. 
We first introduce a generic algorithm, thanks to which we shall instantiate all our algorithms.
Generic algorithm. A generic template (Algorithm alglandexpgeneric) describes the three aforementioned algorithms, and a number of variants. In addition to a stop condition (stating e.g. whether the number of conformations desired has been reached), this template consists of six main steps, namely:
: a function selecting the conformation to be extended, denoted .
: a function generating a new conformation, denoted , from the conformation just selected. As we shall see below, an extension is an operation consisting of one or several calls to a move set operation (see definition blnmovesets).
: Test of the Metropolis type, stating whether the new conformation is accepted or not.
: A function incorporating the sample just accepted into a data structure representing the growing set of configurations.
: adaptation of the move set parameters, used by the extension algorithm. (Intuitively, the extension should not call too many move steps.)
: update of the parameters governing the acceptance test—in particular temperature adaption. Intuitively, the acceptance ratio should not fall below a prescribed value.
Parameters of the algorithm. By parameters of an algorithm, we refer to the variables and constants which shape its behavior. We distinguish four types of parameters (Fig. figlandexpexploparams):
Exploration variables: the core variables governing the exploration. The typical such variables are the temperature and the step size – the step size is used by the move set. These variables are generally global, but may also be made local so as to account for specific features of the landscape. For example, the values of and may be locally adapted to the topography of the landscape. When necessary, in presenting the algos, we distinguish:
Tuning constants: multiplicative values used to evolve the exploration variables.
Control constants: the constants controlling how exploration variables are evolved. The typical constants are:
Exploration variables: target probability for the success of an extension.
Exploration parameters Parameters i.e. variables and constants used by the algorithms. 
Details for . The function , which selects the conformation to be extended, may require a distance threshold, denoted , and an energy threshold, denoted (see e.g. ). When the selection is performed amidst a large database of conformations, the execution speed may be improved using data structures providing fast retrieval of nearest neighbors. This is in particular the case if the is used as a distance. In any case, to initialize the algorithm, a database (possibly reducing to a single conformation0 of samples to select conformations from must be passed.
Details for . A move set is a unitary operation thanks to which a new conformation is generated from a given conformation, typically at a predefined distance called the step size denoted . Three classical move sets are the socalled global moveset, interpolation moveset, and atomic moveset, as detailed in section blnmovesets.
To perform an extension, see function above, it might be necessary to iterate the generation step using the move set until some conditions are fulfilled, see algorithms and below.
Because the extension is in general an expensive operations, an adaptation mechanism for the parameters governing the move set is called for, whence the function .
Details for . The prototypical acceptance test is the Metropolis criterion, which merely requires a temperature [91] .
Details for . After the acceptance test is performed, the temperature can be updated through a multiplicative factor .
Yet, several default tuning modes are offered:
the default method consists on decreasing the temperature each time a sample is accepted, and increasing the temperature after a number of consecutive failures.
the probability method attempts to reach a given target acceptance ratio for the acceptance test. In this mode, every acceptance tests, the acceptance ratio observed since the beginning of the run is computed. A temperature update is triggered if this ratio is less than the target.
Details for .
In recording conformations, we face two cases which are detailed below:
Without filtering data structure: the case of . If no filtering is carried out during the execution, duplicates are easily sorted out upon termination with a script, e.g.:
rm f $1_uniq.txt; touch $1_uniq.txt; rm f $1_energies_uniq.txt; touch $1_energies_uniq.txt; zero=0 i=1 while read p; do count=$(grep c "$p" "$1_uniq.txt") if [ "$count" eq "$zero" ]; then echo "$p" >> "$1_uniq.txt" sed "${i}q;d" "$1_energies.txt" >> "$1_energies_uniq.txt" fi ((i++)) done < $1.txtIndeed, this does not guarantee anymore the number of different minima / samples that have been visited.
With filtering data structure: the case of and .
As mentioned above, filtering consists in identifying duplicates, at a sublinear cost with respect to the conformations already identified. We distinguish filtering on the energy and on distances.
Filtering on energy commonly consists of checking that the energy of a newly found conformation differs by an amount from those collected so far. To be performed in logarithmic time, this check requires a dictionary storing conformations and their energies.
Filtering on energies is in general not sufficient though, since the set, in conformational space, of conformations having a prescribed energy is generically a manifold, with the dimension of the conformational space.
To filter on distances (e.g., using the least rootmeansquare deviation, ), several alternatives are possible. One approach consists of assigning an energy (bin) to all local minima having energy in this bin. Upon discovering a novel minimum, it is then sufficient to compute distances from local minima stored with that energy, retaining the new candidate provided that its nearest neighbor is at least at some predefined distance . A second approach consists of using data structures supporting (approximate) nearest neighbor queries [126] . For queries under the distance in particular, one may use metric trees ~[142] or generalizations such as proximity forests~[114] . This strategy also allows guaranteeing a minimum distance between any two conformations.
Details for . Four different stop criteria are currently offered:
the number of samples criterion : the algorithm stops when a number of registered conformations is reached.
the landmarks criterion: the algorithm stops once given specified landmarks have been visited—up to a distance criterion.
the quenched to landmarks criterion: each sample is quenched, and the algorithm halts when the local minima specified have been visited—up to a distance criterion.
This section provides a minimalistic example based on the program , showing how to run the explorer and exploit its results. Details to go beyond this simple example are provided in the subsequent sections.
Goals. Given a molecule and a force field defining its PEL (e.g CHARMM or Amber), the explorer identifies non redundant local minima and their energies.
Example. The simplest way to explore this PEL is to use the executable sbllandexphybridBHTRRTatomic.exe which implements the socalled hybrid algorithm detailed below. To launch the exploration of the PEL defined by the Amber force field, one runs:
./sbllandexphybridBHTRRTatomic.exe v1 ffparametersfilename amber.xml cspdbfiles <path/to/pdb/file>
This command will iteratively provides information on the local minima discovered. For each local minimum, the five pieces of information provided read as follows:
... #2  E: 83.168395  T: 1.900000  D: 0.200000  98 remaining samples ...
Input.
The input options are in order :
Output.
The program produces during the run two important files which are updated each time a new minimum is recorded:
sbllandexphybridBHTRRTatomic__minima.txt : conformations of these minima in the Point_d format, i.e with the number of coordinates followed by the coordinates themselves.
Main options and parameters.
The main parameters are the temperature and the delta parameters, also called the exploration parameters. One can directly tuned them as follows :
It is also possible to tune directly each step of the algorithm, as seen in section A generic / template algorithm .
Other parameters.
It is also possible to install a VMD plugin to run exploration from VMD and visualizing the produced minima in live – see Installing VMD plugins for more details
Numerous options are available to tune the algorithms, as detailed below.
Instantiation for . We now present the functions used by , explaining in particular how the parameters of the algorithm (temperature , step size ) are adapted:
: for , this function simply returns the last local minimum generated.
: a function applying a move set to the conformation to be extended, then quenching it. This process is iterated until a novel local minimum is discovered–equivalently, the conformation generated must leave the catchment basin of the local minimum being extended.
: the Metropolis Transition test.
: Adds the local minimum just accepted to the path of local minima being generated.
: In the used here, the stepsize is adapted so as to reach a target probability – representing the fraction of extended samples escaping the current basin. The update is performed at a predetermined frequency with respect to the number of extension attempts.
The basin hopping algorithm The algorithm performs a walk in the space of local minima. Upon generating a new conformation by applying a move set to a local minimum , this conformation is minimized by following the negative gradient of the energy, yielding a local minimum . 
> sbllandexpBHRastrigin.exe v1 r o l initsample data/origin2D.txt movesetmode global adaptivedisplacedelta 5 temperaturetuningmode probability nbsamples 100 directory results
File Name  Description 
2d points text file  List of 2D points for initializing the engine 
Preview  File Name  Description 
General: log file  
Log file  Log file containing high level information on the run of sbllandexpBHRastrigin.exe  
Samples  
Samples file  Text file listing all the visited conformations, one conformation per line  
Minima file  Text file listing all the visited local minima, one conformation per line 
Instantiation for . The method generates a tree spanning the conformational space. Following [80] , the steps are:
: this function generates uniformly at random a configuration in the conformational space, and returns the local minimum nearest to it. This strategy promotes samples with large Voronoi regions– intuitively corresponding to regions with low sampling density– as a sample is selected with a probability proportional to the volume of its Voronoi region~[89] . We call this property the Voronoi bias (selection) rule hereafter.
: the new conformation is obtained by linearly interpolating between and , at a predefined distance from .. Note that by the convexity of Voronoi regions, the new sample remains within the Voronoi cell of . It is also possible to iteratively generate a new conformation until the new conformation lies in the current Voronoi cell, i.e. in the Voronoi cell of the sample chosen for extension – as one may indeed generate a new conformation outside this Voronoi cell.
: Metropolis test as for . (See also remark rmkenergyrange .)
: the accepted sample is used to add the edge to the random tree grown.
: in , the step size is constant.
NB: the previous is sufficient when using an interpolation scheme to extend the point selected. (If the euclidean distance is used, by convexity of Voronoi regions, any point on the linesegment lies in the Voronoi cell of .) The situation differs in using a move set, since the point generated might be located outside the Voronoi region targeted. In that case, the extension requires generating samples until a point in the Voronoi region targeted has been found. This may require a step size adaptation—a situation analogous to that faced in .
: upon accepting a sample with increasing energy, the temperature is tuned using the range of energies of samples already inserted into the tree (supplemental algorithm algadaptTdalgoBH and [52]).
temperature adaptation taking into account the maximal energy variation of the conformational ensemble generated
> sbllandexpTRRTRastrigin.exe v1 r o l initsample data/origin2D.txt movesetmode global nbsamples 100 directory results
File Name  Description 
2d points text file  List of 2D points for initializing the engine 
Preview  File Name  Description 
General: log file  
Log file  Log file containing high level information on the run of sbllandexpTRRTRastrigin.exe  
Samples  
Samples file  Text file listing all the visited conformations, one conformation per line  
Minima file  Text file listing all the visited local minima, one conformation per line 
Intuitively, our hybrid algorithm may be seen as a variant of in which, instead of systematically extending the local minimum just found, the Voronoi bias rule from is used every steps, in order to select the local minimum to be extended. In the following, the parameter is called the switch parameter. Similarly to , the algorithm generates a tree connecting local minima. In , however, this tree can be partitioned into threads, each thread consisting of local minima generated by a contiguous sequence of extensions and quenches of (Fig. figlandexpalgoHY).
More precisely, uses the following five functions to instantiate the generic algorithm~alglandexpgeneric :
: once every steps, the Voronoi bias is used to select the local minimum to be extended. For the remaining steps, the sample to be extended is the last local minimum generated.
: the extension strategy is identical to that used in . (NB: this departs from in that here a move set is used, as opposed to an interpolation scheme.)
: the acceptance test is identical to that used in .
: the record step is identical to that used in .
: upon starting a new thread, i.e. upon selecting a sample to extend using the Voronoi bias rule, the step size is reinitialized.
The following important remarks are in order:
The switch parameter can be optimized to best exploit features of a given PEL [124] .
For the extension strategy, using a move set (rather than the interpolation in ) provides much more flexibility. This turns out to be critical to focus the exploration on low energy regions.
In the parameter update step, resetting the parameters upon starting a new thread is also critical, as this allows locally adapting the parameters to the landscape. Because explores very diverse regions of the PEL thanks to the Voronoi bias rule, failure to reinitialize may result in using inappropriate parameter values. We note in passing that an alternative to the reset was tested in which each local minimum was stored with the parameters it was discovered with–see comments on the memory mechanism in the Results section.
The algorithm consists of interleaving calls to and . In this cartoon, starting from the initial configuration (black dot), one step of is performed after every three steps of (i.e., ). The local minima generated by a contiguous sequence of extensions and quenches are represented with the same color. Each such set of local minima is also called a thread. Thus, in this example, has generated four threads of size three. 
We illustrate using the four landscapes mentioned in Introduction:
> sbllandexphybridBHTRRTRastrigin.exe v1 r o l initsample data/origin2D.txt nbmainselector 10 movesetmode global nbsamples 100 directory results
> sbllandexphybridBHTRRTtrigoterrain.exe v1 r o l initsample data/origin2D.txt nbmainselector 100 movesetmode global nbsamples 1000 directory results
> sbllandexphybridBHTRRTBLN.exe v1 r o l initsample data/bln69_global_minimum.txt nbmainselector 10 movesetmode atomic nbsamples 100 directory results ffparametersfilename data/bln69.xml
> sbllandexphybridBHTRRTamber.exe v1 r o l initsample data/ala5.txt nbmainselector 10 movesetmode atomic nbsamples 100 directory results cspdbfiles data/ala5.pdb ffparametersfilename data/amber.xml fflengthinnanometer ffenergyinkilojoule ffchargeincoulomb
File Name  Description 
2d points text file  List of 2D points for initializing the engine 
Ddimensional points text file  Global minimum of the potential energy function of the BLN69 energy landscape 
Ddimensional points text file  A conformation of pentalanine 
PDB file  Pentalanine PDB file for covalent structure building 
Force Field XML file  Amber 99SB force field parameters stored in the ffxml format. 
Preview  File Name  Description 
General: log file  
Log file  Log file containing high level information on the run of sbllandexphybridBHTRRTRastrigin.exe  
Log file  Log file containing high level information on the run of sbllandexphybridBHTRRTtrigoterrain.exe  
Log file  Log file containing high level information on the run of sbllandexphybridBHTRRTBLN.exe  
Samples  
Samples file  Text file listing all the visited conformations of the Rastrigin function, one conformation per line  
Minima file  Text file listing all the visited local minima of the Rastrigin function, one conformation per line  
Samples file  Text file listing all the visited conformations of the trigonometric function, one conformation per line  
Minima file  Text file listing all the visited local minima of the trigonometric function, one conformation per line  
Samples file  Text file listing all the visited conformations of BLN69, one conformation per line  
Minima file  Text file listing all the visited local minima of BLN69, one conformation per line 
All algorithms presented in this package follow the generic template of fig. alglandexpgeneric.
They use different ingredients, including (i) calculation of the potential energy (see packages Molecular_covalent_structure, Molecular_potential_energy) (ii) evaluation of the gradient of a realvalue function (see package Real_value_function_minimizer), and (iii) spatial data structures for nearest neighbor searches (see package Spatial_search).
Note that the computation of the gradient is only semiautomatized, as mentioned in Remark. 1 of section A generic / template algorithm . In particular, the user should provide its own implementation of the gradient of the energy function, whatever is the source. An option is to use an automatic differentiation tool such as Tapenade to generate this gradient.
In addition to the usual external libraries used in the SBL (CGAL, Boost, Eigen), this package is using two other external libraries :
The programs of Landscape_explorer described above are based upon generic C++ classes, so that additional versions can easily be developed.
In order to derive other versions, there are two important ingredients, that are the workflow class, and its traits class.
This class defines the main types used in the modules of the workflow. It is templated by the classes of the concepts required by these modules. This design makes it possible to use the same workflow within different(biophysical) contexts to make new programs. To use the workflow T_Landscape_explorer_workflow , one needs to define:
ExplorationAlgorithm  Algorithm for exploring a landscape, such as basins hopping (see T_Exploration_algorithm_basins_hopping). It provides base functionnalities for selecting, extending and inserting a sample in a database of conformations. 
ExplorationParameters  Data structure grouping the non contextual parameters used during the algorithm (i.e the parameters are not depending on the nature of the class ExplorationAlgorithm). It manages the way parameters are loaded, stored, updated and accessed. For example, the class T_Exploration_parameters_default stores each parameter uniquely, while the class T_Exploration_parameters_with_memory stores one set of parameters for each registered sample. 
DistanceFunction  Functor taking two conformations as input and returning the distance between the two conformations. It has to define the types Point (alias for the conformation) and FT (alias for the number type). An example instantiation is the class SBL::CSB::T_Least_RMSD_cartesian from the package Molecular_distances . 
QuencherSample  functor taking a conformation as input and returning, if possible, a minimization of the input conformation. It uses the package Real_value_function_minimizer for minimizing. 
T_Landscape_explorer_workflow: