Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.

In the following, we provide the main concepts and the terminology used in SBL
We now introduce the general terminology used throughout the library.
A molecular structure is a representation of one or more molecules made of particles , namely atoms, or pseudoatoms for coarsegrain models. The description of a molecular structure has two components: the Molecular Geometric Model and the Molecular Systems, as described below.
In short:
By molecular geometric model , we refer to the threedimensional arrangement of the particles constituting a molecule or a complex.
Practically, we consider two classes of representations for molecular geometries:
representations based on Work Package: Space Filling Models, i.e. collections of balls.
For the input / output operations on these molecular geometric models, see section Conformation Loaders .
In the SBL, models are typically used in applications gathered in eponymous parts. For example, space filling models, defined in section Work Package: Space Filling Models, are used in the part of the library using such models i.e. Space Filling Model.
A molecular system is a grouping of the particles of a molecular structure. Such grouping are used when the groups formed have specific properties, or when one wishes to investigate interactions between such groups.
Given a molecular structure, a set of labels , together with a mapping from the particles to the set , the molecular system associated to a label is the set of all the particles having as label.
The labels may have a hierarchical structure, in which case they are represented as (a forest of) trees. In that case, we distinguish between primitive labels and hierarchical labels. The set of all primitive labels is denoted , and given a label , the set of primitive labels in the subtree rooted at is denoted . Note that if is a primitive label, then .
Given a list of particles , induces a unique partition of , so that the particle classifier associated with is the mapping from to the list of all primitive labels : .
In this respect, the molecular system associated with a label , system( ) for short, is the preimage of under the particle classifier , i.e .
Practically, a number of classifiers defining molecular systems are provided. These classifiers follow a generic pattern (a C++ concept to perform a hierarchical decomposition), as defined in the package MolecularSystemLabelsTraits.
In the SBL, each particle is annotated with properties. Depending on the context, the annotations may be compulsory or optional.
A compulsory annotation is such that memory space to store it within each particle is allocated at compile time to store it. As the name suggests, such annotations are mandatory in a given context. See section Compulsory Annotations for more details about compulsory annotations.
As an example, one may consider the case of particles represented by 3D balls: annotating such particles with a radius is mandatory. See section Atomic Radii and Group Radii for more details.
On the opposite, optional annotations are loaded on the fly – no storage reserved at compile time. Such annotations are typically used to further analyze the results of a SBL program. Optional annotations are dynamics, i.e. dynamically loaded using the option –annotationsfile, while running the SBL program. Note that any number of annotations can be loaded just by repeating the option for each annotation file.
As an example of optional annotations, one may consider solvation parameters, on a per particle (atom) basis. See section Optional Annotations for more details.
A precise description of the annotation's system is described in the package ParticleAnnotator.
The SBL relies on the Easy Structural Biology Template Library (ESBTL) to load the pieces of information contained in a PDB file into SBL data structures. The ESBTL library is highly adaptable when it comes to the input source format.
It defines a hierarchical representation of a protein, from the whole molecule to each atom of the protein. Each level of the hierarchy is represented by a data structure, which may be replaced by a custom one. When loading the data from a PDB file, it is possible to define which fields of a PDB file will be used, and how the different data structures will be filled with these fields.
The hierarchy of data structures in the ESBTL library, from bottom to top, is the following one:
the molecular atom : the bottom of the hierarchy, meant to represent an atom in a molecule; by default, all the fields stored in a PDB file are stored.
the molecular residue : a container of molecular atoms representing a residue. In particular, it is possible to visit all the atoms in the residue.
the molecular chain : a container of molecular residues representing a polypeptidic chain. In particular, it is possible to visit all the residues in the chain.
the molecular model : a container of molecular chains representing a geometric model of a molecule or complex, a model being determined by a set of coordinates. For example, if the PDB file contains several NMR models, there will be as many ESBTL models. Given a model, an iterator to visit its chains is provided.
the molecular system : the root representation. In the simplest setting, the root corresponds to the whole molecule or systems. But as seen earlier with labels, the molecule or complex may have been split into pieces, in which case, there are as many systems as pieces.
Given a system, an iterator to visit its models is provided.
The ESBTL offers also the possibility to visit the secondary structure elements (SSE) when they are available from the input PDB. The SSE are containers of residues accessible from the molecular chains. In particular, it is possible to query the type of the SSE according to the PDB nomenclature, or to visit all residues or atoms in the SSE.
A space filling model is a molecular geometric model where each particle is represented by a 3D ball. Such models are of special interest to represent molecular surfaces and volumes, as well as interfaces. In the SBL, applications using space filling models are grouped in the Applications page Space Filling Model .
In the sequel, we review some basic facts about these models. For a discussion of such models from the biophysical standpoint, the reader may consult [61] as well as [60] and the references therein. For a treatment of selected aspects of such models, shapes in particular, the reader may consult [55].
A finite family of balls is denoted , the th ball is denoted , and its bounding sphere . The union of the balls in is denoted :
The boundary operator is denoted . For example, is exactly the sphere . Also, the boundary of the space filling model of is denoted .
In a space filling model, the radius of a particle typically depends on the atomic type and on its covalent environment.
In dealing with crystal structures, the radii may be adapted, depending on the presence or not of hydrogen atoms. When all H atoms have not been reported, which is in general the case, a common strategy consists of using socalled group radii: the group radius of one heavy atom accounts for its own size together with that of the H atoms it is covalently bonded to, see [37] .
In the SBL, the radii of particles are annotations attached to the particles, loaded from a file specifying the radius for each particle type. In particular, the class SBL::Models::T_Radius_annotator_for_particles_with_annotated_name allows:
to load the radii from a specification file, or to load default radii if no file is given (default radii are the ones from [37] , see SBL::Models::T_Radius_annotator_for_particles_with_annotated_name),
to add a constant value to all loaded radii (e.g for accounting for a water probe, by adding 1.4 to all radii, see section Solvent Accessible Model),
For atoms, two group radii are available: from Chothia et al in 1975 [38] (the default ones, available here), and from Tsai et al in 1999 [109] (avalaible here). When using pseudoatoms representing the residues, there is one radius per residue type. The radius of an amino acid is computed as the radius of a 3D sphere from the average volume of the residue [98] (available here).
The Solvent Accessible Model (SAM) is a molecular geometric model where the particle radii are expanded by the mean radius of a water molecule (circa ). Doing so results in a space filling model, where particles nearby in 3D space, yet, not covalently linked, intersect.
The Solvent Accessible Surface (SAS) of a SAM is the boundary of the balls defining the SAM. The SAS consists of spherical polygons, circle arcs (found at the intersection of two spheres), and vertices (found at the intersection of three spheres), as defined in the package Union_of_balls_boundary_3. The area of the SAS is called the Solvent Accessible Surface Area (SASA) .
Consider two partners and forming a complex . One classical way to identify the interface particles of this complex, using the SAM of , is the following: any particle contributing to the SAS of its subunit, part of this exposed surface being covered in the complex . Phrased differently, the particle contributes to the SAS of its subunit, but part (possibly all) of this surface is covered by the partner. The buried surface area (BSA) of the complex is defined by:
See also the Buried_surface_area package for more details.
Using this model, the core and the rim of an interface are easily defined [80] : the rim consists of the particles retaining solvent accessibility in the complex, while the core consists of the particles which are buried in the complex.
This model is related to Voronoi models of interfaces [27] , [18] , and [31] , which may be seen as improvements in several respects, in particular:
selected particles which are buried in their own subunit can be found at the interface.
In the SBL, topics related to conformational analysis are studied in the more general setting of energy landscapes [113]. The SBL provides numerous tools to sample EL and study the resulting sets of conformations, as detailed in [30] and [100].
The corresponding applications are gathered in the part Conformational Analysis. The corresponding C++ code hinges on the following concepts:
Conformations and their representations
Energy landscapes and their representations
Algorithms to explore energy landscapes
Conformations and their representation in Cartesian or internal coordinates. We consider a macromolecular system involving atoms, the thatom being denoted . The conformational space of the system is denoted , and its dimension . A conformation or sample refers to a conformation of the system.
In Cartesian coordinates, each atom is attributed 3 coordinates (x, y ,z). In internal coordinates – also known as Zmatrix, see [94], the atoms of a molecule are rather described using bond distances between two atoms, bond angles between three atoms and torsion angles between four atoms – see Fig. figinternalcoordinates . It is possible to switch from a coordinates system to another one by applying a transformation. However, each transformation requires an information that is not encoded in the original coordinates system.
Internal coordinates representation: illustration for a system of four atoms The four atoms are ; they are assumed to be non coplanar, and the covalent bonds are represented by bold line segments. There are three bond lengths represented by bold solid line segments, two bond angles represented by solid circular arcs, and one torsion angle represented by dashed circular arcs. Note that the torsion angle is the dihedral angle between the plane defined by and the plane defined by . 
Moving from internal coordinates to cartesian coordinates. This transformation requires the cartesian coordinates of the first three atoms:
The first atom corresponds to the origin of the coordinate system. Practically, its three Cartesian coordinates are set to zero.
The second atom is at a fixed distance from the first one (their bond distance), so that there are two degrees of freedom left. Practically, one may set the x value to distance, and the remaining two coordinates to zero.
Moving from cartesian coordinates to internal coordinates. Assume that the topology of the molecule (the bonds) is known. From this topology, internal coordinates can be computed for each bond (its bond length), for each pair of consecutive bonds (their bond angle), and for each triple of consecutive bonds (their torsion angle).
The class SBL::CSB::T_Molecular_internal_coordinates allows to compute individually each bond length, bond angle and torsion angle.
Comparing conformations. In general the distance between two conformations is denoted . A default choice is the least root mean square deviation ( ), namely the square root of the average squared distance deviation in atom positions, minimized over rigid motions of the system. The with chirality may also be used.
Sampling. In the context of the SBL, a sampling refers to a set of conformations. It may also be called a conformational ensemble , even though it does not carry any statistical property.
A conformational ensemble , also called sampling is a set of conformations, that is . Whenever a local energy minimizer is available, that is, when the conformations can be quenched, the associated set of quenched conformations is denoted , that is .
If Cartesian coordinates are used, the conformations in can also be aligned on a reference conformation, say . Aligning the ith conformation onto results in the representation of that conformation denoted , and the set of such conformations is denoted:
Once a onetoone correspondence between the atoms of and has been set, aligning onto requires computing the rigid motion yielding the least root mean square deviation between and .
Nearest neighbor graph. We define:
That is, a NNG connects conformations in the configuration space of the system studied. We use NNG in two guises. First, we build a NNG by connecting a sample to its nearest neighbors. Second, we build a NNG by connecting a sample to all samples of the ensemble within a given distance .
Landscape and potential energy landscape (PEL). We define an energy landscape [113], or landscape for short, as a triple:
conformational space ,
a height function ,
distance function .
The height function is a mapping from the conformational space to the real numbers. Given a fixed elevation , the portion of the landscape located below (resp. above) the elevation is called a sublevel set (resp. super level set ).
Critical points and their connexions. If the gradient of the height function vanishes at , the conformation is called a critical point or stationary point . Practically, we shall deal with two types of critical points: local minima, and index one saddles (saddles for short). The function value at a critical point is called the critical value.
For a given minimum, or particular interest is the lowest transition state that directly connects it to a minimum of lower energy:
If is not a critical point, quenching consists of (numerically) following the negative gradient until a local minimum is found. In this case, there exists an integral curve of the gradient vector field joining to , and one says that flows to .
Discrete representations involving samples. Various constructions can be carried out by combining samplings and energies.
A lifted sample is a sample equipped with a real number, called its height. When this number represents the potential energy, a collection of such samples is called a sampled energy landscape.
Also, using the connectivity of a NNG to connect lifted samples results in a lifted NNG. For example, if the height is the potential energy, the lifted NNG defines a network on the PEL.
Discrete representations involving critical (i.e., stationary) points only. Of special interest on a landscape are the local minima and the transition paths connecting them. In the smooth setting, a transition between two local minima corresponds to the existence of an integral curve joining the saddle to the minimum. More specifically, in Morse theory [87], these curves define the socalled unstable manifold of the saddle. Generically, two such curves are found for each saddle – note however that they may end up in the same local minimum, a situation we refer to as a bump transition (Fig. figbumpmiddleslopebassin).
Note that in a compressed TG, all vertices correspond to local minima, while every edge corresponds to two minima which are connected through a saddle in the TG. As we shall see, such compressed graphs are useful to derive a number of properties of EL, and also to compare them.
For a number of tasks related to the analysis of energy landscapes, it is useful to think of the TG as a bipartite graph:
This definition calls for two comments:
In computational topology [11], the MSW complex is the mathematical object allowing one to efficiently compute the homology of sublevel sets of a manifold, from a function defined on that manifold – in our case the conformational space and the energy. The MSW complex involves critical points of all indices, but for energy landscapes, we shall mainly use local minima and index one saddles.
Selected saddles associated with a transition graph can also be used to define the following:
The following comments are in order:
The forest associated with a DG has a single tree when the TG is connected.
Generically, in the context of smooth Morse theory, a key saddle is linked to two local minima, which may coincide. Practically, degeneracies where the key saddles is linked to more local minima may be encountered.
Several important operations can be carried on landscapes, including:
A loop around a saddle may be anchored in a single local minimum – a.k.a bump transition Note that the dotted path is located behind the bump of this fictitious 2D landscape. 
The Himmelblau function: landscape The function is defined by: . It has four local minima, four index one saddles, and one local maximum. 
Himmelblau: (Compressed) Transition graph (A)The landscape of Himmelblau, decomposed into the catchment basins of the four local minima (B)The transition graph, with one node for each critical (i.e., stationary) point (C)The compressed transition graph, where the information associated with saddles is stored in the edges joining local minima. 
Himmelblau: disconnectivity graph 
This section briefly presents selected landscapes used to test our sampling algorithms.
As a simple illustration, we use the Himmelblau function, see Fig. fighimmelblau and also Himmelblau's function on Wikipedia.
The Himmeblau function 
Function value. The function value is a degree two bivariate polynomial:
Gradient. Easily computed by hand.
Movet set to generate a new sample. To generate a new conformation, one picks a random conformations at a predefined distance from the current sample.
The Rastrigin function is a classical non convex function used in optimization benchmarks.
Function value. Denoting the ddimensional vector of coordinates, the function value is defined as follows:
.
Gradient. Easily computed by hand.
Movet set to generate a new sample. To generate a new conformation, one picks a random conformations at a predefined distance from the current sample.
The trigonometric terrain is a more complex function [99], challenging exploration algorithms in the 2D case, see Fig. figterminotrigoterrain .
The Trigonometric terrain function 
Function value. The function value is :
Gradient. Computed by hand.
Movet set to generate a new sample. To generate a new conformation, one picks a random conformation at a predefined distance from the current sample.
To illustrate our sampling algorithms, we use a 69 residue BLN model protein [20] , whose landscape has been extensively sampled [113], [91] .
The BLN model represents each protein residue as one of 3 types of beads, namely hydrophobic(B), hydrophilic(L) and neutral(N).
Function value: potential energy. The potential energy of the BLN69 model is given by:
Note that the first three terms are bonded terms, while the fourth is the non bonded term (LennardJones potential). Parameter definitions and values are as specified in [113] .
Gradient. To compute the gradient of expression eqblnpotential , we use the automatic differentiation tool [65] .
Move sets to generate new conformations. A move set is a unitary operation thanks to which a new conformation is generated from a given conformation, typically at a predefined distance called the step size denoted . Designing move sets for condensed matter in general and proteins in particular is a topic in itself, as one wishes to avoid useless conformations (e.g., steric clashes).
Three classical movesets, illustrated here for BLN69, are the following ones:
global moveset: the new conformation is chosen uniformly at random on the sphere of radius centered on the current conformation.
interpolation moveset: the new conformation is chosen (uniformly at random) on the linesegment joining two conformations.
For the sake of clarity, let us detail the atomic move set. Denoting the number of pseudoatoms of the BLN model. and let , with the number of pseudoatoms. Denoting the coordinates of the th atom, the new coordinates are generated uniformly at random on the unit sphere of radius centered . That is, with and uniform random numbers in and :
Note that in applying such a move set, the RMSD between the old and the new conformations is equal to .
Example conformation of the BLN69 model The three types of beads are represented as follows: hydrophobic (B) in red, hydrophylic (L) in blue, and neutral (N) in green. Note the formation of a hydrophobic core clubbing the hydrophobic 
Loading structures and geometric models consists of converting structures and geometric models stored in a file into main memory data structures.
As seen from Fig. figterminologyloader, loading involves three ingredients, namely:
In the following, we will detail the information contained in the different loadable file formats, and in which contextes they can be used.
Data flow from files to data structures The files (first row) containing the data are loaded using loaders (second row) into internal data structures. Then, builders transform these internal data structures into data structures (third row) that are usable by the different components of the SBL. Depending on the context, these data structures may be defined by different models (last row). 
The Protein Data Bank is the reference resource for structures of macromolecules and their complexes. For general information on structures, one may consult : Introduction to Biological Assemblies and the PDB Archive
For information on the PDB file format, one may consult:
In the SBL, the PDB files are loaded into C++ data structures using the class SBL::Models::T_PDB_file_loader: this class uses the (ESBTL) for parsing PDB files and uses parsimonious data structures to store the hierarchical information contained in a PDB file. See the documentation of the class SBL::Models::T_PDB_file_loader for a more detailed description of the available options.
A molecule loaded using the class SBL::Models::T_PDB_file_loader is represented by the ESBTL class ESBTL::Molecular_model . Two comments are in order.
Multiple models. First, if a molecule is represented by several models in the PDB file, one instance of ESBTL::Molecular_model is created for each model. It is possible to access to these data structures using the method SBL::Models::T_PDB_file_loader::get_geometric_model .
Conversions. Depending on the context, the PDB format may have to be converted to a different data format, using a builder. Such a class takes as input an instance of ESBTL::Molecular_model, and fills an output data structure dedicated to a particular context.
For example, in Work Package: Space Filling Models, the data structure can be a container of particles, each particle being represented by a 3D ball – see SBL::Models::T_Atom_with_flat_info_traits::Atom_with_flat_infos_builder . In Work Package: Conformational Analysis, the data structure can be a conformation, that is represented by a Ddimensional point – see SBL::Models::T_Conformation_as_d_point_traits::Conformation_as_d_point_builder .
In the SBL, the applications implement combinatorial, geometric and topological in a biophysical context. While loading a PDB file gives a great number of biophysical properties, one may use the applications on molecules stored in much more basic formats. The simplest format for representing a particle is a 3D ball (or its bounding 3D sphere). A file listing 3D spheres as follows:
x_1 y_1 z_1 r_1 ...
can be loaded using the loader SBL::Models::T_Spheres_3_file_loader .
The class SBL::Models::T_Geometric_particle_traits defines a particle reduced to its geometric representation. Then, it is possible to annotate these geometric particles using the annotators provided by the package ParticleAnnotator, as explained in section Decorating Models .
When dealing with conformations, the same problem as section Loading 3D spheres occurs. The simplest format for representing a conformation is a Ddimensional point, that is a concatenation of all the coordinates of the particles of a molecule. A file listing Ddimensional points as follows:
6 x_1 y_1 z_1 x_2 y_2 z_2 ...
can be loaded using the loader SBL::Models::T_Points_d_file_loader .
The class SBL::Models::T_Geometric_conformation_traits defines a conformation reduced to its geometric representation.
When using the programs of the SBL, one may want to decorate the particles of the input molecule(s) to analyze the output as a function of some properties. For example, to compute the ratio of buried residues in a protein that are hydrophobic, one may use the program . The output provides the exposure of each atom, each atom being decorated by the residue containing it. However, the hydrophobicity of a residue being not available from a PDB file, one has to add this information.
The SBL offers the possibility to annotate dynamically the particles of a molecule with userdefined properties. The term dynamical refers to the possibility to load new properties when starting the program, and to decorate each particle with this new property. It is opposed to static that refers to properties that are already decorating the particles, but that can be modified (e.g, the atomic group radii). Therefore, an atom that is reported in an output file will be reported with all its annotations.
In all the programs of Space Filling Model, dynamical properties can be loaded using the option –annotationsfile <path/to/file>. A file describing such properties has to follow simple rules, as shown in the following example:
# First line: 3 keywords i.e. (i) annotation name (ii) key composition (iii) type of annotation # Subsequent lines: hydrophobicity RES_NAME char ALA H ARG C GLU P ...
For more details on annotations, see the package ParticleAnnotator.
The SBL uses intensively the Boost Serialization for saving and loading the data structures of the different programs. In the Boost Serialization,
"the term serialization means the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes."
The file containing this sequence of bytes is called an archive. There are three main file formats for an archive: plain text, binary and XML.
In the SBL, the serialization is used for two different goals:
for saving a data structure in an archive, that will possibly be loaded by another program of the SBL,
Since input files for PALSE are XML files, all the archives in the SBL are XML files. In the following, two issues are discussed:
some data structures cannot be fully saved into an archive, and cannot be loaded again – see section Partial Serialization,
While the serialization is a powerful framework for saving and loading data structures, it is not always possible to do so due to the complexity of some data structures. In the following, we introduce the terminology to handle such cases.
In order to serialize an atom as in the previous example, one has to serialize the residue containing it. The process may be recursive, unwinding the hierarchy residue > chain > model > molecular structure.
A is hierarchical iff A has an attribute of type B such that A borrows B and B owns A.
A data structure is termed partially serializable when it cannot be serialized because it is hierarchical. In other words, when the data structure in memory is saved into a file, the pieces of information saved are not sufficient to reconstruct the data structure in main memory from that file. In our previous example, the atom will not be able to access the information on its residue.
Partially serializable data structures use also the Boost Serialization for saving data : the only difference with serializable data structures is that they cannot be loaded from a file.
Another problem encountered in serializing data is the amount of information contained within an archive.
For example, an archive listing the atoms of a molecule contains all the properties of the atom in order to reconstruct in memory this atom. However, when analyzing the output of a program of the SBL with PALSE, the different analysis may not require all the pieces of information contained in the archive.
As a second example, consider an archive listing thousands of conformations of a molecule involving of the order of one thousand atoms: the corresponding archive contains millions of lines and is hard to navigate through.
A solution provided by the SBL consists in storing the information contained in a data structure in at least two different archives:
the main archive contains reduced information where the heavy part is replaced by a simple index,
Saving in multiple archives. The class SBL::IO::T_Multiple_archives_serialization_xml_oarchive provides functionalities to save a serializable data structure into multiple archives. There are two ways to store a serializable data structure:
by providing the paths to both the main and the secondary archives when constructing an instance of SBL::IO::T_Multiple_archives_serialization_xml_oarchive,
by providing only the path to the main archive, in which case the information that is not in the main archive is lost.
In the last case, the data structure cannot be loaded since the information is lost.
Loading from multiple archives. The class SBL::IO::T_Multiple_archives_serialization_xml_iarchive provides functionalities to load a serializable data structure from multiple archives. The only way to load a serializable data structure is to specify the path to the main and secondary archives when constructing an instance of SBL::IO::T_Multiple_archives_serialization_xml_iarchive .
For more detailed information on multiple archives serialization, see the package Multiple_archives_serialization .