Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
|
Authors: F. Cazals and T. Dreyfus and C. Robert and A. Roth
In this package, we consider the problem of sampling a landscape, that is a height function defined as the real valued multivariate function. In biophysics, we notice, though, that this function typically codes the potential energy of a molecular system, so that the landscape is the potential energy landscape (PEL).
The algorithms presented fall in the realm of Monte Carlo based methods, in a broad sense. The precise meaning of sampling shall depend on the specific algorithms presented, as one may want in particular to:
Locate the low lying local minima of the function, with potentially a focus on the global minimum.
We actually provide a generic framework to develop such algorithms, providing instantiations for classical algorithms as well as algorithms developed recently.
Of particular interest in the sequel are the following three algorithms:
: the basin hopping algorithm, which performs a random walk in the space of local minima [118] , [172] .
: the rapidly exploring random trees algorithm, which favors the exploration of yet unexplored space based upon the Voronoi bias [99] .
Practically, our main focus is on landscapes encoding the potential energy of a molecular system. Yet, our algorithms also handle the case of a multivariate real-valued function void of biophysical meaning. The ability to handle such diverse cases owes to a generic framework. We illustrate it by providing explorer in three settings:
For a protein model known as BLN69, defined over a configuration space of dimension 3x69=207, see section BLN69,
For small peptides such as pentalanine using the Amber force field, defined over a configuration space of variable dimension (3x53=159 for pentalanine),
Needless to stress that these are simple examples, and that instantiations of our exploration algorithms are easily derived by instantiating our generic classes.
Comparing the explorations on the trigonometric terrain Top.Trigonometric terrain to be explored : Bottom. From Left to Right : Basins Hopping, TRRT and Hybrid. 50 minima (samples for TRRT) were generated. For TRRT, the initial displace delta was 0.5, while for Basins Hopping and Hybrid, it was 0.0001. For the Hybrid algorithm, Basins Hopping is used 10 times, then TRRT is used once : the consecutive minima generated by Basins Hopping are linked with rays of the same color. |
We first introduce a generic algorithm, thanks to which we shall instantiate all our algorithms.
Generic algorithm. A generic template (Algorithm alg-landexp-generic) describes the three aforementioned algorithms, and a number of variants. In addition to a stop condition (stating e.g. whether the number of conformations desired has been reached), this template consists of six main steps, namely:
: a function selecting the conformation to be extended, denoted .
: a function generating a new conformation, denoted , from the conformation just selected. As we shall see below, an extension is an operation consisting of one or several calls to a move set operation (see definition bln-movesets).
: Test of the Metropolis type, stating whether the new conformation is accepted or not.
: A function incorporating the sample just accepted into a data structure representing the growing set of configurations.
: adaptation of the move set parameters, used by the extension algorithm. (Intuitively, the extension should not call too many move steps.)
: update of the parameters governing the acceptance test—in particular temperature adaption. Intuitively, the acceptance ratio should not fall below a prescribed value.
Parameters of the algorithm. By parameters of an algorithm, we refer to the variables and constants which shape its behavior. We distinguish four types of parameters (Fig. fig-landexp-explo-params):
Exploration variables: the core variables governing the exploration. The typical such variables are the temperature and the step size – the step size is used by the move set. These variables are generally global, but may also be made local so as to account for specific features of the landscape. For example, the values of and may be locally adapted to the topography of the landscape. When necessary, in presenting the algos, we distinguish:
Tuning constants: multiplicative values used to evolve the exploration variables.
Control constants: the constants controlling how exploration variables are evolved. The typical constants are:
Exploration variables: target probability for the success of an extension.
Exploration parameters Parameters i.e. variables and constants used by the algorithms. |
Details for . The function , which selects the conformation to be extended, may require a distance threshold, denoted , and an energy threshold, denoted (see e.g. ). When the selection is performed amidst a large database of conformations, the execution speed may be improved using data structures providing fast retrieval of nearest neighbors. This is in particular the case if the is used as a distance. In any case, to initialize the algorithm, a database (possibly reducing to a single conformation0 of samples to select conformations from must be passed.
Details for . A move set is a unitary operation thanks to which a new conformation is generated from a given conformation, typically at a predefined distance called the step size denoted . Three classical move sets are the so-called global moveset, interpolation moveset, and atomic moveset, as detailed in section bln-movesets.
To perform an extension, see function above, it might be necessary to iterate the generation step using the move set until some conditions are fulfilled, see algorithms and below.
Because the extension is in general an expensive operations, an adaptation mechanism for the parameters governing the move set is called for, whence the function .
Details for . The prototypical acceptance test is the Metropolis criterion, which merely requires a temperature [112] .
Details for . After the acceptance test is performed, the temperature can be updated through a multiplicative factor .
Yet, several default tuning modes are offered:
the default method consists on decreasing the temperature each time a sample is accepted, and increasing the temperature after a number of consecutive failures.
the probability method attempts to reach a given target acceptance ratio for the acceptance test. In this mode, every acceptance tests, the acceptance ratio observed since the beginning of the run is computed. A temperature update is triggered if this ratio is less than the target.
Details for .
In recording conformations, we face two cases which are detailed below:
Without filtering data structure: the case of . If no filtering is carried out during the execution, duplicates are easily sorted out upon termination with a script, e.g.:
rm -f $1_uniq.txt; touch $1_uniq.txt; rm -f $1_energies_uniq.txt; touch $1_energies_uniq.txt; zero=0 i=1 while read p; do count=$(grep -c "$p" "$1_uniq.txt") if [ "$count" -eq "$zero" ]; then echo "$p" >> "$1_uniq.txt" sed "${i}q;d" "$1_energies.txt" >> "$1_energies_uniq.txt" fi ((i++)) done < $1.txtIndeed, this does not guarantee anymore the number of different minima / samples that have been visited.
With filtering data structure: the case of and .
As mentioned above, filtering consists in identifying duplicates, at a sub-linear cost with respect to the conformations already identified. We distinguish filtering on the energy and on distances.
Filtering on energy commonly consists of checking that the energy of a newly found conformation differs by an amount from those collected so far. To be performed in logarithmic time, this check requires a dictionary storing conformations and their energies.
Filtering on energies is in general not sufficient though, since the set, in conformational space, of conformations having a prescribed energy is generically a manifold, with the dimension of the conformational space.
To filter on distances (e.g., using the least root-mean-square deviation, ), several alternatives are possible. One approach consists of assigning an energy (bin) to all local minima having energy in this bin. Upon discovering a novel minimum, it is then sufficient to compute distances from local minima stored with that energy, retaining the new candidate provided that its nearest neighbor is at least at some predefined distance . A second approach consists of using data structures supporting (approximate) nearest neighbor queries [155] . For queries under the distance in particular, one may use metric trees ~[178] or generalizations such as proximity forests~[139] . This strategy also allows guaranteeing a minimum distance between any two conformations.
Details for . Four different stop criteria are currently offered:
the number of samples criterion : the algorithm stops when a number of registered conformations is reached.
the landmarks criterion: the algorithm stops once given specified landmarks have been visited—up to a distance criterion.
the quenched to landmarks criterion: each sample is quenched, and the algorithm halts when the local minima specified have been visited—up to a distance criterion.
Instantiation for . We now present the functions used by , explaining in particular how the parameters of the algorithm (temperature , step size ) are adapted:
: for , this function simply returns the last local minimum generated.
: a function applying a move set to the conformation to be extended, then quenching it. This process is iterated until a novel local minimum is discovered–equivalently, the conformation generated must leave the catchment basin of the local minimum being extended.
: the Metropolis Transition test.
: Adds the local minimum just accepted to the path of local minima being generated.
: In the used here, the stepsize is adapted so as to reach a target probability – representing the fraction of extended samples escaping the current basin. The update is performed at a predetermined frequency with respect to the number of extension attempts.
The basin hopping algorithm The algorithm performs a walk in the space of local minima. Upon generating a new conformation by applying a move set to a local minimum , this conformation is minimized by following the negative gradient of the energy, yielding a local minimum . |
BH may be used to explore any model (mathematical function, toy protein model BLN, regular atomic model, etc). Since every such model yields a different executable, this executable is referred to as sbl-landexp-BH-NAME-OF-MODEL.exe in the sequel.
The Jupyter notebook provided in this user manual uses several such executables.
Main options.
Input.
Main output.
Optional output.
Comments.
Instantiation for . The method generates a tree spanning the conformational space. Following [99] , the steps are:
: this function generates uniformly at random a configuration in the conformational space, and returns the local minimum nearest to it. This strategy promotes samples with large Voronoi regions– intuitively corresponding to regions with low sampling density– as a sample is selected with a probability proportional to the volume of its Voronoi region~[110] . We call this property the Voronoi bias (selection) rule hereafter.
: the new conformation is obtained by linearly interpolating between and , at a predefined distance from .. Note that by the convexity of Voronoi regions, the new sample remains within the Voronoi cell of . It is also possible to iteratively generate a new conformation until the new conformation lies in the current Voronoi cell, i.e. in the Voronoi cell of the sample chosen for extension – as one may indeed generate a new conformation outside this Voronoi cell.
: Metropolis test as for . (See also remark rmk-energy-range .)
: the accepted sample is used to add the edge to the random tree grown.
: in , the step size is constant.
NB: the previous is sufficient when using an interpolation scheme to extend the point selected. (If the euclidean distance is used, by convexity of Voronoi regions, any point on the line-segment lies in the Voronoi cell of .) The situation differs in using a move set, since the point generated might be located outside the Voronoi region targeted. In that case, the extension requires generating samples until a point in the Voronoi region targeted has been found. This may require a step size adaptation—a situation analogous to that faced in .
: upon accepting a sample with increasing energy, the temperature is tuned using the range of energies of samples already inserted into the tree (supplemental algorithm alg-adaptTd-algoBH and [69]).
temperature adaptation taking into account the maximal energy variation of the conformational ensemble generated
TRRT may be used to explore any model (mathematical function, toy protein model BLN, regular atomic model, etc). Since every such model yields a different executable, this executable is referred to as sbl-landexp-TRRT-NAME-OF-MODEL.exe in the sequel.
The Jupyter notebook provided in this user manual uses several such executables.
Main options.
Input.
Main output.
Identical to that of Basin-hopping, see above.
Optional output. Comments.
Intuitively, our hybrid algorithm may be seen as a variant of in which, instead of systematically extending the local minimum just found, the Voronoi bias rule from is used every steps, in order to select the local minimum to be extended. In the following, the parameter is called the switch parameter. Similarly to , the algorithm generates a tree connecting local minima. In , however, this tree can be partitioned into threads, each thread consisting of local minima generated by a contiguous sequence of extensions and quenches of (Fig. fig-landexp-algoHY).
More precisely, uses the following five functions to instantiate the generic algorithm~alg-landexp-generic :
: once every steps, the Voronoi bias is used to select the local minimum to be extended. For the remaining steps, the sample to be extended is the last local minimum generated.
: the extension strategy is identical to that used in . (NB: this departs from in that here a move set is used, as opposed to an interpolation scheme.)
: the acceptance test is identical to that used in .
: the record step is identical to that used in .
: upon starting a new thread, i.e. upon selecting a sample to extend using the Voronoi bias rule, the step size is re-initialized.
The following important remarks are in order:
The switch parameter can be optimized to best exploit features of a given PEL [153] .
For the extension strategy, using a move set (rather than the interpolation in ) provides much more flexibility. This turns out to be critical to focus the exploration on low energy regions.
In the parameter update step, resetting the parameters upon starting a new thread is also critical, as this allows locally adapting the parameters to the landscape. Because explores very diverse regions of the PEL thanks to the Voronoi bias rule, failure to reinitialize may result in using inappropriate parameter values. We note in passing that an alternative to the reset was tested in which each local minimum was stored with the parameters it was discovered with–see comments on the memory mechanism in the Results section.
The algorithm consists of interleaving calls to and . In this cartoon, starting from the initial configuration (black dot), one step of is performed after every three steps of (i.e., ). The local minima generated by a contiguous sequence of extensions and quenches are represented with the same color. Each such set of local minima is also called a thread. Thus, in this example, has generated four threads of size three. |
Hybrid may be used to explore any model (mathematical function, toy protein model BLN, regular atomic model, etc). Since every such model yields a different executable, this executable is referred to as sbl-landexp-hybrid-NAME-OF-MODEL.exe in the sequel.
The Jupyter notebook provided in this user manual uses several such executables.
Main options.
Input.
In dealing with atomic molecular models in general and proteins in particular, the following are also needed:
Main output. Whatever the system, the output files are as described previously:
Identical to that of Basin-hopping and TRRT, see above.
Optional output.
Comments.
All algorithms presented in this package follow the generic template of fig. alg-landexp-generic.
They use different ingredients, including (i) calculation of the potential energy (see packages Molecular_covalent_structure, Molecular_potential_energy) (ii) evaluation of the gradient of a real-value function (see package Real_value_function_minimizer), and (iii) spatial data structures for nearest neighbor searches (see package Spatial_search).
Note that the computation of the gradient is only semi-automatized, as mentioned in Remark. 1 of section A generic / template algorithm . In particular, the user should provide its own implementation of the gradient of the energy function, whatever is the source. An option is to use an automatic differentiation tool such as Tapenade to generate this gradient.
In addition to the usual external libraries used in the SBL (CGAL, Boost, Eigen), this package is using two other external libraries :
The programs of Landscape_explorer described above are based upon generic C++ classes, so that additional versions can easily be developed.
In order to derive other versions, there are two important ingredients, that are the workflow class, and its traits class.
T_Landscape_explorer_traits:
T_Landscape_explorer_workflow:
See the following jupyter notebook:
The exploration of an energy landscape may focus on different features:
* the extent explored
* the identification of (deep) local minima
* the identification of saddle points / transition points
In the following, we illustrate three different methods, which in a nutshell may be characterized by:
* Basin-hopping: focus on local minima -- see below
* Rapidly exploring random trees: focus on the extent visited and transition states
* Hybrid: mix between BH and T-RRT
Results are presented for the following systems:
* a complex trigonometric function defined
* BLN69, a toy protein model, whose conformation requires 3x69=207 Cartesian coordinates
* Penta-alanine, aka Ala5
import re
import math
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
# A Function to be used later to retrieve specific output files
def find_file_in_output_directory(str, odir):
cmd = "find %s -name *%s" % (odir,str)
return os.popen(cmd).readlines()[0].rstrip()
def plot_dendogram_of_conformations(distances):
# read the n(n-1)/2 distances and compute n
lines = open(distances, "r").readlines()
s = len(lines)
n = int((1+math.sqrt(1+8*s))/2)
# load distances into np array
mat = np.zeros((n,n))
for line in lines:
aux = re.split("\s", line.rstrip())
mat[int(aux[0]), int(aux[1])] = float(aux[2])
mat[int(aux[1]), int(aux[0])] = float(aux[2])
# plot dendogram
dists = squareform(mat)
linkage_matrix = linkage(dists, "single")
dendrogram(linkage_matrix, labels= list(range(0,n+1)))
plt.title("Conformations: dendogram based on lrmsd")
plt.show()
$ f(x,y)=(x\sin(20y)+y\sin(20x))^2\cosh(\sin(10x)x)+(x\cos(10y)-y\sin(10x))^2\cosh(\cos(20y)y)$
import matplotlib.pyplot as plt
from PIL import Image
fig, ax = plt.subplots(1, 3, figsize=(100,100))
images = ["fig/bh-trigo-50.png", "fig/trrt-trigo-50.png", "fig/hybrid-trigo-50.png"]
for i in range(0,3):
ax[i].set_xticks([]); ax[i].set_yticks([])
ax[i].imshow( Image.open(images[i]) )
print("Displayed")