Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
Terminology and Concepts

In the following, we provide the main concepts and the terminology used in SBL

We now introduce the general terminology used throughout the library.

Generalities: Molecular Structures, Geometry and Systems

Molecular structure

A molecular structure is a representation of one or more molecules made of particles , namely atoms, or pseudo-atoms for coarse-grain models. The description of a molecular structure has two components: the Molecular Geometric Model and the Molecular Systems, as described below.

In short:

Molecular structure = Molecular Geometric Model + Molecular System.

Molecular Geometric Model

By molecular geometric model , we refer to the three-dimensional arrangement of the particles constituting a molecule or a complex.

Practically, we consider two classes of representations for molecular geometries:

For the input / output operations on these molecular geometric models, see section Conformation Loaders .

In the SBL, models are typically used in applications gathered in eponymous parts. For example, space filling models, defined in section Work Package: Space Filling Models, are used in the part of the library using such models i.e. Space Filling Model.

Molecular Systems

A molecular system is a grouping of the particles of a molecular structure. Such grouping are used when the groups formed have specific properties, or when one wishes to investigate interactions between such groups.

Given a molecular structure, a set of labels $ L $, together with a mapping from the particles to the set $ L $, the molecular system associated to a label $ l $ is the set of all the particles having $ l $ as label.

The labels may have a hierarchical structure, in which case they are represented as (a forest of) trees. In that case, we distinguish between primitive labels and hierarchical labels. The set of all primitive labels is denoted $ P $, and given a label $ l $, the set of primitive labels in the sub-tree rooted at $ l $ is denoted $ P(l) $. Note that if $ l $ is a primitive label, then $ P(l)={l} $.

Given a list of particles $ \cal A $, $ L $ induces a unique partition of $ \cal A $, so that the particle classifier associated with $ L $ is the mapping from $ \cal{A} $ to the list of all primitive labels $ P $: $ f_L: {\cal A} \mapsto P $.

In this respect, the molecular system associated with a label $ l $, system( $ l $) for short, is the pre-image of $ P(l) $ under the particle classifier $ f_L $, i.e $ f_L^{(-1)}(P(l)) = \cup_{x\in P(l)} f_L^{(-1)}(x) $.

As a simple example, one may consider a set of labels composed of the names of the chains of a protein complex whose structure is stored in the PDB format. In that case, each polypeptide chain defines a molecular system. Such labels may be used to study the interface between the proteins defining the complex, using the program Space_filling_model_interface_finder from the application $ \text{\bifED} $.


Concept defining labels to identify particles in a molecular system associated to a molecular structure


Practically, a number of classifiers defining molecular systems are provided. These classifiers follow a generic pattern (a C++ concept to perform a hierarchical decomposition), as defined in the package MolecularSystemLabelsTraits.

Particle Annotations

In the SBL, each particle is annotated with properties. Depending on the context, the annotations may be compulsory or optional.

A compulsory annotation is such that memory space to store it within each particle is allocated at compile time to store it. As the name suggests, such annotations are mandatory in a given context. See section Compulsory Annotations for more details about compulsory annotations.

As an example, one may consider the case of particles represented by 3D balls: annotating such particles with a radius is mandatory. See section Atomic Radii and Group Radii for more details.

On the opposite, optional annotations are loaded on the fly – no storage reserved at compile time. Such annotations are typically used to further analyze the results of a SBL program. Optional annotations are dynamics, i.e. dynamically loaded using the option –annotations-file, while running the SBL program. Note that any number of annotations can be loaded just by repeating the option for each annotation file.

As an example of optional annotations, one may consider solvation parameters, on a per particle (atom) basis. See section Optional Annotations for more details.

A precise description of the annotation's system is described in the package ParticleAnnotator.

The ESBTL Framework

The SBL relies on the Easy Structural Biology Template Library (ESBTL) to load the pieces of information contained in a PDB file into SBL data structures. The ESBTL library is highly adaptable when it comes to the input source format.

It defines a hierarchical representation of a protein, from the whole molecule to each atom of the protein. Each level of the hierarchy is represented by a data structure, which may be replaced by a custom one. When loading the data from a PDB file, it is possible to define which fields of a PDB file will be used, and how the different data structures will be filled with these fields.

The hierarchy of data structures in the ESBTL library, from bottom to top, is the following one:

  • the molecular atom : the bottom of the hierarchy, meant to represent an atom in a molecule; by default, all the fields stored in a PDB file are stored.

  • the molecular residue : a container of molecular atoms representing a residue. In particular, it is possible to visit all the atoms in the residue.

  • the molecular chain : a container of molecular residues representing a polypeptidic chain. In particular, it is possible to visit all the residues in the chain.

  • the molecular model : a container of molecular chains representing a geometric model of a molecule or complex, a model being determined by a set of coordinates. For example, if the PDB file contains several NMR models, there will be as many ESBTL models. Given a model, an iterator to visit its chains is provided.

  • the molecular system : the root representation. In the simplest setting, the root corresponds to the whole molecule or systems. But as seen earlier with labels, the molecule or complex may have been split into pieces, in which case, there are as many systems as pieces.

    Given a system, an iterator to visit its models is provided.

(Advanced) There are subtle differences differences between the SBL and the ESBTL definitions of models and systems. In particular, the SBL has a notion of hierarchy for labels—which is not the case for ESBTL systems. Also, models from ESBTL are strongly coupled to NMR models, while those from the SBL are more general.


The ESBTL offers also the possibility to visit the secondary structure elements (SSE) when they are available from the input PDB. The SSE are containers of residues accessible from the molecular chains. In particular, it is possible to query the type of the SSE according to the PDB nomenclature, or to visit all residues or atoms in the SSE.

Work Package: Space Filling Models

A space filling model is a molecular geometric model where each particle is represented by a 3D ball. Such models are of special interest to represent molecular surfaces and volumes, as well as interfaces. In the SBL, applications using space filling models are grouped in the Applications page Space Filling Model .

In the sequel, we review some basic facts about these models. For a discussion of such models from the biophysical standpoint, the reader may consult [61] as well as [60] and the references therein. For a treatment of selected aspects of such models, $ \alpha $-shapes in particular, the reader may consult [55].

Space Filling Models: Notations

A finite family of balls is denoted $ \balls $, the $ i $-th ball is denoted $ \balli $, and its bounding sphere $ \spherei $. The union of the balls in $ \balls $ is denoted $ \sfmballs $:

$ \sfmballs = \bigcup_\balls\balli $.

The boundary operator is denoted $ \boundary{} $. For example, $ \boundary{\balli} $ is exactly the sphere $ \spherei $. Also, the boundary of the space filling model of $ \balls $ is denoted $ \boundary{\sfmballs} $.

Atomic Radii and Group Radii

In a space filling model, the radius of a particle typically depends on the atomic type and on its covalent environment.

In dealing with crystal structures, the radii may be adapted, depending on the presence or not of hydrogen atoms. When all H atoms have not been reported, which is in general the case, a common strategy consists of using so-called group radii: the group radius of one heavy atom accounts for its own size together with that of the H atoms it is covalently bonded to, see [37] .

In the SBL, the radii of particles are annotations attached to the particles, loaded from a file specifying the radius for each particle type. In particular, the class SBL::Models::T_Radius_annotator_for_particles_with_annotated_name allows:

For atoms, two group radii are available: from Chothia et al in 1975 [38] (the default ones, available here), and from Tsai et al in 1999 [109] (avalaible here). When using pseudo-atoms representing the residues, there is one radius per residue type. The radius of an amino acid is computed as the radius of a 3D sphere from the average volume of the residue [98] (available here).

Solvent Accessible Model

The Solvent Accessible Model (SAM) is a molecular geometric model where the particle radii are expanded by the mean radius of a water molecule (circa $ 1.4\AA $). Doing so results in a space filling model, where particles nearby in 3D space, yet, not covalently linked, intersect.

The Solvent Accessible Surface (SAS) of a SAM is the boundary of the balls defining the SAM. The SAS consists of spherical polygons, circle arcs (found at the intersection of two spheres), and vertices (found at the intersection of three spheres), as defined in the package Union_of_balls_boundary_3. The area of the SAS is called the Solvent Accessible Surface Area (SASA) .

Interfaces, Buried Surface Area, and Core-rim Models

Consider two partners $ A $ and $ B $ forming a complex $ C=A\cup B $. One classical way to identify the interface particles of this complex, using the SAM of $ C $, is the following: any particle contributing to the SAS of its subunit, part of this exposed surface being covered in the complex $ C $. Phrased differently, the particle contributes to the SAS of its subunit, but part (possibly all) of this surface is covered by the partner. The buried surface area (BSA) of the complex is defined by:

$ BSA(A \cup B) = \marea{SAS(A)}+\marea{SAS(B)} - \marea{SAS(A\cup B)} $

See also the Buried_surface_area package for more details.

Using this model, the core and the rim of an interface are easily defined [80] : the rim consists of the particles retaining solvent accessibility in the complex, while the core consists of the particles which are buried in the complex.

This model is related to Voronoi models of interfaces [27] , [18] , and [31] , which may be seen as improvements in several respects, in particular:

  • selected particles which are buried in their own sub-unit can be found at the interface.

  • the binary partitioning into a core and a rim gets refined by a notion of depth (shelling order) which measures the distance to the boundary of the interface.

Work Package: Conformational Analysis

In the SBL, topics related to conformational analysis are studied in the more general setting of energy landscapes [113]. The SBL provides numerous tools to sample EL and study the resulting sets of conformations, as detailed in [30] and [100].

The corresponding applications are gathered in the part Conformational Analysis. The corresponding C++ code hinges on the following concepts:

  • Conformations and their representations

  • Energy landscapes and their representations

  • Algorithms to explore energy landscapes

  • Algorithms to compare (sampled) energy landscapes

Conformations

Conformations and their representation in Cartesian or internal coordinates. We consider a macromolecular system $ m $ involving $ s $ atoms, the $ i $th-atom being denoted $ \atomi{m}{i} $. The conformational space of the system is denoted $ \calC $, and its dimension $ d $. A conformation or sample refers to a conformation of the system.

In Cartesian coordinates, each atom is attributed 3 coordinates (x, y ,z). In internal coordinates – also known as Z-matrix, see [94], the atoms of a molecule are rather described using bond distances between two atoms, bond angles between three atoms and torsion angles between four atoms – see Fig. fig-internal-coordinates . It is possible to switch from a coordinates system to another one by applying a transformation. However, each transformation requires an information that is not encoded in the original coordinates system.

Internal coordinates representation: illustration for a system of four atoms
The four atoms are $ a_1, a_2, a_3, a_4 $; they are assumed to be non coplanar, and the covalent bonds are represented by bold line segments. There are three bond lengths represented by bold solid line segments, two bond angles represented by solid circular arcs, and one torsion angle represented by dashed circular arcs. Note that the torsion angle is the dihedral angle between the plane defined by $ a_1, a_2, a_3 $ and the plane defined by $ a_2, a_3, a_4 $ .

Moving from internal coordinates to cartesian coordinates. This transformation requires the cartesian coordinates of the first three atoms:

  • The first atom corresponds to the origin of the coordinate system. Practically, its three Cartesian coordinates are set to zero.

  • The second atom is at a fixed distance from the first one (their bond distance), so that there are two degrees of freedom left. Practically, one may set the x value to distance, and the remaining two coordinates to zero.

  • Finally, the third atom is at a fixed distance from the second one, and makes a fixed angle with the first two (their bond angle). Practically, the $ x $ and $ y $ coordinates are inferred from these constraints, and $ z $ is set to zero.

Moving from cartesian coordinates to internal coordinates. Assume that the topology of the molecule (the bonds) is known. From this topology, internal coordinates can be computed for each bond (its bond length), for each pair of consecutive bonds (their bond angle), and for each triple of consecutive bonds (their torsion angle).

The class SBL::CSB::T_Molecular_internal_coordinates allows to compute individually each bond length, bond angle and torsion angle.

The previous description assumes an atomic model, as coarse grain models require dedicated rules. For example, when the molecule is a protein represented by a coarse-grain model with only $ C_{\alpha} $, the topology is trivial, as the only bonds are those defining a path connecting consecutive $ C_{\alpha} $ carbons.


Comparing conformations. In general the distance between two conformations is denoted $ \dCalC{\cdot}{\cdot} $. A default choice is the least root mean square deviation ( $ \lRMSD $), namely the square root of the average squared distance deviation in atom positions, minimized over rigid motions of the system. The $ \lRMSD $ with chirality may also be used.

Let $ C_1 $ and $ C_2 $ be two conformations of a molecule of $ N $ atoms. We note $ x_{1, 3*i}, x_{1, 3*i+1}, x_{1, 3*i+2} $ the $ x, y, z $ coordinates of the $ i $-th atom of the conformation $ C_1 $. The regid registration of $ C_2 $ over $ C_1 $ is denoted $ \hat{C}_2 $, and its coordinates $ \hat{x}_{2, i} $. The least-RMSD of $ C_1 $ and $ C_2 $ is the root mean square distance between coordinates of $ C_1 $ and $ \hat{C_2} $:
$ \sqrt{\frac{\Sigma_{}^{}(\hat{x}_{2, i} - x_{1,i})^2}{3*N}} $


In investigating energy landscapes, a pitfall must be mentioned: two conformations may be separated by relatively small distances, yet separated by large or even insurmountable (enthalpic) barriers.


Conformational ensembles -- aka samplings

Sampling. In the context of the SBL, a sampling refers to a set of conformations. It may also be called a conformational ensemble , even though it does not carry any statistical property.

A conformational ensemble , also called sampling $ C $ is a set of $ n $ conformations, that is $ C={ c_1, \dots, c_n} $. Whenever a local energy minimizer is available, that is, when the conformations can be quenched, the associated set of quenched conformations is denoted $ C_q $, that is $ C_q = { q(c_1), \dots, q(c_n)} $.

If Cartesian coordinates are used, the conformations in $ C $ can also be aligned on a reference conformation, say $ c_1 $. Aligning the i-th conformation onto $ c_1 $ results in the representation of that conformation denoted $ \alignedconf{c_i} $, and the set of such conformations is denoted:

$ \tilde{C} = \{ \alignedconf{c_1},\dots, \alignedconf{c_n}\}. $

Once a one-to-one correspondence between the atoms of $ c_i $ and $ c_1 $ has been set, aligning $ c_i $ onto $ c_1 $ requires computing the rigid motion yielding the least root mean square deviation between $ c_1 $ and $ c_i $.

Nearest neighbor graph. We define:

A nearest neighbor graph (NNG) of conformations is a graph whose vertices are conformations, with edges joining selected pairs.


That is, a NNG connects conformations in the configuration space $ \calC $ of the system studied. We use NNG in two guises. First, we build a NNG by connecting a sample to its $ k $ nearest neighbors. Second, we build a NNG by connecting a sample to all samples of the ensemble within a given distance $ r $.

Energy landscapes

Landscape and potential energy landscape (PEL). We define an energy landscape [113], or landscape for short, as a triple:

  • conformational space $ \calC $,

  • a height function $ h: \calC \mapsto \Rnt $,

  • distance function $ \calC \times \calC \rightarrow \Rnt^{+} $.

The height function $ h $ is a mapping from the conformational space $ \calC $ to the real numbers. Given a fixed elevation $ h $, the portion of the landscape located below (resp. above) the elevation $ h $ is called a sublevel set (resp. super level set ).

A typical case is that where $ h $ represents the potential energy $ V $ of a molecular system. In that case, the landscape obtained is called the potential energy landscape (PEL). The potential energy of a conformation $ c $ is denoted $ V(c) $.


Critical points and their connexions. If the gradient of the height function vanishes at $ c $, the conformation is called a critical point or stationary point . Practically, we shall deal with two types of critical points: local minima, and index one saddles (saddles for short). The function value at a critical point is called the critical value.

For a given minimum, or particular interest is the lowest transition state that directly connects it to a minimum of lower energy:

Consider the index one saddles connected to a given local minimum, and for each such saddle, consider the elevation drop between the saddle and the local minimum. The saddle (if any) of lowest elevation drop leading to a local minimum of lower energy is called the key saddle .


If $ c $ is not a critical point, quenching $ c $ consists of (numerically) following the negative gradient $ -\nabla h $ until a local minimum $ q(c) $ is found. In this case, there exists an integral curve of the gradient vector field $ -\nabla h $ joining $ c $ to $ q(c) $, and one says that $ c $ flows to $ q(c) $.

Our definition of landscape does not embed a move set (see definition bln-movesets), used to generate new conformations of the system. Move sets are typically used in cunjunction with exploration algorithms, though. See package Landscape_explorer.


Discrete representations involving samples. Various constructions can be carried out by combining samplings and energies.

A lifted sample is a sample equipped with a real number, called its height. When this number represents the potential energy, a collection of such samples is called a sampled energy landscape.

Also, using the connectivity of a NNG to connect lifted samples results in a lifted NNG. For example, if the height is the potential energy, the lifted NNG defines a network on the PEL.

Discrete representations involving critical (i.e., stationary) points only. Of special interest on a landscape are the local minima and the transition paths connecting them. In the smooth setting, a transition between two local minima corresponds to the existence of an integral curve joining the saddle to the minimum. More specifically, in Morse theory [87], these curves define the so-called unstable manifold of the saddle. Generically, two such curves are found for each saddle – note however that they may end up in the same local minimum, a situation we refer to as a bump transition (Fig. fig-bump-middle-slope-bassin).

A transition graph (TG) associated with a landscape is a graph whose nodes are minima and saddles, with one edge between the minimum $ m_1 $ and the saddle $ \sigma $ if there exists a direct transition path $ (m_1, \sigma, m_2) $ with $ m_2 $ another local minimum.


A compressed transition graph is a TG where the two edges emanating from a saddle are merged.


Note that in a compressed TG, all vertices correspond to local minima, while every edge corresponds to two minima which are connected through a saddle in the TG. As we shall see, such compressed graphs are useful to derive a number of properties of EL, and also to compare them.

For a number of tasks related to the analysis of energy landscapes, it is useful to think of the TG as a bipartite graph:

Given a transition graph, consider the bipartite graph whose vertex sets are the local minima and the index one saddles, and whose edges correspond to the connexions minima-saddles. This bipartite graph is called the restricted Morse-Smale-Witten chain complex (MSW complex for short) of the landscape.


This definition calls for two comments:

  • In computational topology [11], the MSW complex is the mathematical object allowing one to efficiently compute the homology of sublevel sets of a manifold, from a function defined on that manifold – in our case the conformational space and the energy. The MSW complex involves critical points of all indices, but for energy landscapes, we shall mainly use local minima and index one saddles.

  • Practically, our MSW complexes shall in general be associated with molecular data generated by some simulation - exploration method. The MSW complex used might be incomplete if, for example, the exploration method missed saddles (transitions) between local minima.

Selected saddles associated with a transition graph can also be used to define the following:

The disconnectivity graph (DG) associated with a transition graph is a forest of trees such that:
  • the leaves correspond to local minima,
  • the internal nodes correspond to the key saddles.
  • there is one edge for each key saddle, connecting the basins associated with its two local minima.


The following comments are in order:

  • The forest associated with a DG has a single tree when the TG is connected.

  • Generically, in the context of smooth Morse theory, a key saddle is linked to two local minima, which may coincide. Practically, degeneracies where the key saddles is linked to more local minima may be encountered.

  • When DG are simplified using energy slices, several key saddles may be merged.

Several important operations can be carried on landscapes, including:

A loop around a saddle $ s $ may be anchored in a single local minimum $ m $ – a.k.a bump transition
Note that the dotted path is located behind the bump of this fictitious 2D landscape.

The Himmelblau function: landscape
The function is defined by: $ f(x,y) = (x^2+y-11)^2 + (x+y^2-7)^2 $. It has four local minima, four index one saddles, and one local maximum.

Himmelblau: (Compressed) Transition graph
(A)The landscape of Himmelblau, decomposed into the catchment basins of the four local minima (B)The transition graph, with one node for each critical (i.e., stationary) point (C)The compressed transition graph, where the information associated with saddles is stored in the edges joining local minima.

Himmelblau: disconnectivity graph
The persistence of each local minimum (apart from the global minimum) is defined here as the lowest barrier height to a lower-energy minimum, i.e.~the absolute value of the energy difference between the
The persistence of the global minimum is infinite, since this particular minimum does not possess a key saddle. Persistences, sometimes referred to as prominences, are labelled $ \ebarrier $ here.


Example Landscapes

This section briefly presents selected landscapes used to test our sampling algorithms.

Himmelblau

As a simple illustration, we use the Himmelblau function, see Fig. fig-himmelblau and also Himmelblau's function on Wikipedia.

The Himmeblau function

Function value. The function value is a degree two bivariate polynomial:

$ f(x,y) = (x^2 + y - 11)^2 + (x+y^2-7)^2. $

Gradient. Easily computed by hand.

Movet set to generate a new sample. To generate a new conformation, one picks a random conformations at a predefined distance $ \delta $ from the current sample.

Rastrigin

The Rastrigin function is a classical non convex function used in optimization benchmarks.

Function value. Denoting $ X $ the d-dimensional vector of coordinates, the function value is defined as follows:

$ f(X) = A d + \sum_{i=1,\dots,d} (x_i^2 - A \cos(2\pi x_i)) $

.

Gradient. Easily computed by hand.

Movet set to generate a new sample. To generate a new conformation, one picks a random conformations at a predefined distance $ \delta $ from the current sample.

The trigonometric terrain

The trigonometric terrain is a more complex function [99], challenging exploration algorithms in the 2D case, see Fig. fig-termino-trigo-terrain .

The Trigonometric terrain function

Function value. The function value is :

$ f(x,y) = (x\sin(20y)+y\sin(20x))^2\cosh(\sin(10x)x)+(x\cos(10y)-y\sin(10x))^2\cosh(\cos(20y)y) $

Gradient. Computed by hand.

Movet set to generate a new sample. To generate a new conformation, one picks a random conformation at a predefined distance $ \delta $ from the current sample.

Model protein used: BLN69

To illustrate our sampling algorithms, we use a 69 residue BLN model protein [20] , whose landscape has been extensively sampled [113], [91] .

The BLN model represents each protein residue as one of 3 types of beads, namely hydrophobic(B), hydrophilic(L) and neutral(N).

Function value: potential energy. The potential energy of the BLN69 model is given by:

\begin{align} V = \frac{1}{2} K_r \sum_{i}^{N-1} (R_{i,i+1} - R_\text{e})^2 + \frac{1}{2} K_\theta \sum_{i}^{N-2} (\theta_{i} - \theta_\text{e})^2 \\ + \epsilon \sum_{i}^{N-3} [A_i (1 + \cos \phi_{i}) + B_i(1 + \cos 3\phi_{i})] \\ + 4 \epsilon \sum_{i}^{N-2} \sum_{j=i+2}^{N} C_{i,j} [ (\frac{\sigma}{R_{i,j}})^{12} - D_{i,j} (\frac{\sigma}{R_{i,j}})^6 ]. \end{align}

Note that the first three terms are bonded terms, while the fourth is the non bonded term (Lennard-Jones potential). Parameter definitions and values are as specified in [113] .

Gradient. To compute the gradient of expression eq-bln-potential , we use the automatic differentiation tool $ \text{\tapenade} $ [65] .

Move sets to generate new conformations. A move set is a unitary operation thanks to which a new conformation is generated from a given conformation, typically at a predefined distance called the step size denoted $ \delta $. Designing move sets for condensed matter in general and proteins in particular is a topic in itself, as one wishes to avoid useless conformations (e.g., steric clashes).

Three classical movesets, illustrated here for BLN69, are the following ones:

  • global moveset: the new conformation is chosen uniformly at random on the sphere of radius $ \delta $ centered on the current conformation.

  • interpolation moveset: the new conformation is chosen (uniformly at random) on the line-segment joining two conformations.

  • atomic moveset: each atom is moved to a sphere centered on its current location in parameter space,

For the sake of clarity, let us detail the atomic move set. Denoting $ N $ the number of pseudo-atoms of the BLN model. and let $ \eps=\delta/\sqrt{N} $, with $ N $ the number of pseudo-atoms. Denoting $ (x_i,y_i,z_i) $ the coordinates of the $ i $th atom, the new coordinates are generated uniformly at random on the unit sphere of radius $ \eps $ centered $ (x_i,y_i,z_i) $. That is, with $ u $ and $ z $ uniform random numbers in $ [0,1] $ and $ [-1,1] $:

\begin{equation} \begin{cases} x^{'}_i &= x_i + \eps \sqrt{1-z^2} \cos 2\pi u,\\ y^{'}_i &= y_i + \eps \sqrt{1-z^2} \sin 2\pi u,\\ z^{'}_i &= z_i + \eps z_i. \end{cases} \end{equation}

Note that in applying such a move set, the RMSD between the old and the new conformations is equal to $ \delta $.

Example conformation of the BLN69 model
The three types of beads are represented as follows: hydrophobic (B) in red, hydrophylic (L) in blue, and neutral (N) in green. Note the formation of a hydrophobic core clubbing the hydrophobic

Generalities: Loading Structures and Geometric Models

Loading structures and geometric models consists of converting structures and geometric models stored in a file into main memory data structures.

As seen from Fig. fig-terminology-loader, loading involves three ingredients, namely:

  • The input file
  • The loader
  • The targetted data structure(s).

In the following, we will detail the information contained in the different loadable file formats, and in which contextes they can be used.

Data flow from files to data structures
The files (first row) containing the data are loaded using loaders (second row) into internal data structures. Then, builders transform these internal data structures into data structures (third row) that are usable by the different components of the SBL. Depending on the context, these data structures may be defined by different models (last row).

Loading PDB Files

The Protein Data Bank is the reference resource for structures of macro-molecules and their complexes. For general information on structures, one may consult : Introduction to Biological Assemblies and the PDB Archive

For information on the PDB file format, one may consult:

In the SBL, the PDB files are loaded into C++ data structures using the class SBL::Models::T_PDB_file_loader: this class uses the (ESBTL) for parsing PDB files and uses parsimonious data structures to store the hierarchical information contained in a PDB file. See the documentation of the class SBL::Models::T_PDB_file_loader for a more detailed description of the available options.

A molecule loaded using the class SBL::Models::T_PDB_file_loader is represented by the ESBTL class ESBTL::Molecular_model . Two comments are in order.

Multiple models. First, if a molecule is represented by several models in the PDB file, one instance of ESBTL::Molecular_model is created for each model. It is possible to access to these data structures using the method SBL::Models::T_PDB_file_loader::get_geometric_model .

Conversions. Depending on the context, the PDB format may have to be converted to a different data format, using a builder. Such a class takes as input an instance of ESBTL::Molecular_model, and fills an output data structure dedicated to a particular context.

For example, in Work Package: Space Filling Models, the data structure can be a container of particles, each particle being represented by a 3D ball – see SBL::Models::T_Atom_with_flat_info_traits::Atom_with_flat_infos_builder . In Work Package: Conformational Analysis, the data structure can be a conformation, that is represented by a D-dimensional point – see SBL::Models::T_Conformation_as_d_point_traits::Conformation_as_d_point_builder .

Loading 3D spheres

In the SBL, the applications implement combinatorial, geometric and topological in a biophysical context. While loading a PDB file gives a great number of biophysical properties, one may use the applications on molecules stored in much more basic formats. The simplest format for representing a particle is a 3D ball (or its bounding 3D sphere). A file listing 3D spheres as follows:

x_1 y_1 z_1 r_1
...

can be loaded using the loader SBL::Models::T_Spheres_3_file_loader .

The class SBL::Models::T_Geometric_particle_traits defines a particle reduced to its geometric representation. Then, it is possible to annotate these geometric particles using the annotators provided by the package ParticleAnnotator, as explained in section Decorating Models .

Loading D-dimensional points

When dealing with conformations, the same problem as section Loading 3D spheres occurs. The simplest format for representing a conformation is a D-dimensional point, that is a concatenation of all the coordinates of the particles of a molecule. A file listing D-dimensional points as follows:

6 x_1 y_1 z_1 x_2 y_2 z_2
...

can be loaded using the loader SBL::Models::T_Points_d_file_loader .

The class SBL::Models::T_Geometric_conformation_traits defines a conformation reduced to its geometric representation.

Decorating Models

When using the programs of the SBL, one may want to decorate the particles of the input molecule(s) to analyze the output as a function of some properties. For example, to compute the ratio of buried residues in a protein that are hydrophobic, one may use the program $ \text{\vorlumeEP} $. The output provides the exposure of each atom, each atom being decorated by the residue containing it. However, the hydrophobicity of a residue being not available from a PDB file, one has to add this information.

The SBL offers the possibility to annotate dynamically the particles of a molecule with user-defined properties. The term dynamical refers to the possibility to load new properties when starting the program, and to decorate each particle with this new property. It is opposed to static that refers to properties that are already decorating the particles, but that can be modified (e.g, the atomic group radii). Therefore, an atom that is reported in an output file will be reported with all its annotations.

In all the programs of Space Filling Model, dynamical properties can be loaded using the option –annotations-file <path/to/file>. A file describing such properties has to follow simple rules, as shown in the following example:

# First line: 3 keywords i.e. (i) annotation name (ii) key composition (iii) type of annotation
# Subsequent lines: 
hydrophobicity RES_NAME char
ALA H
ARG C
GLU P
...

For more details on annotations, see the package ParticleAnnotator.

Generalities: IO operations, Serialization

The SBL uses intensively the Boost Serialization for saving and loading the data structures of the different programs. In the Boost Serialization,

"the term serialization means the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes."

The file containing this sequence of bytes is called an archive. There are three main file formats for an archive: plain text, binary and XML.

In the SBL, the serialization is used for two different goals:

  • for saving a data structure in an archive, that will possibly be loaded by another program of the SBL,

  • to analyze the output of programs from the SBL, and compute statistics, in particular using the package PALSE .

Since input files for PALSE are XML files, all the archives in the SBL are XML files. In the following, two issues are discussed:

Partial Serialization

While the serialization is a powerful framework for saving and loading data structures, it is not always possible to do so due to the complexity of some data structures. In the following, we introduce the terminology to handle such cases.

Given an instance a of a data structure A, the lifetime of a is the interval of time between its creation and its destruction.


Given a data structure A:
  • A is flat iff the lifetimes of all the attributes of an instance a of A is included in the lifetime of a;
  • A is hierarchical iff A is not flat.


Consider the class SBL::Models::T_Atom_with_flat_info_traits::Particle_type representing an atom: an instance of this class contains only attributes that are initialized when the atom is constructed and get destroyed when the atom is destroyed : this class is flat.


Consider the class SBL::Models::T_Atom_with_hierarchical_info_traits::Particle_type representing an atom: an instance of this class contains a reference to the residue containing this atom. The residue is created before the atom, and is destroyed after the atom is destroyed : this class is hierarchical.


In order to serialize an atom as in the previous example, one has to serialize the residue containing it. The process may be recursive, unwinding the hierarchy residue > chain > model > molecular structure.

Given instances a and b of two data structures A and B:
  • A owns B iff A has an attribute of type B that is destroyed when A is destroyed;
  • A borrows B iff A has an attribute of type B that is not destroyed when A is destroyed.


Let A be a data structure:
  • A is hierarchical iff A has an attribute of type B such that A borrows B and B owns A.

  • A is hierarchical iff A owns B, B borrows A, and B has an attribute of type A.
  • A is flat iff A is not hierarchical.


Hierarchical data structures occur when a part of the data structure references the whole data structure itself–a dependence hindering the serialization.
For example, in the previous definition, let A be a residue and B be an atom: the residue has a list of atoms, and each each atom references information on the residue. Thus, to serialize the atom, the whole residue must be serialized too.


A data structure is termed partially serializable when it cannot be serialized because it is hierarchical. In other words, when the data structure in memory is saved into a file, the pieces of information saved are not sufficient to reconstruct the data structure in main memory from that file. In our previous example, the atom will not be able to access the information on its residue.

Partially serializable data structures use also the Boost Serialization for saving data : the only difference with serializable data structures is that they cannot be loaded from a file.

Note that if A is a hierarchical data structure because of a data structure B, there is no reason for B to be hierarchical. For example, an atom is hierarchical because it references the residue it belongs to, that references its polypeptidic chain, that reference its geoemtric model, that references its molecular structure. The molecular structure is flat because it makes no reference to a data structure that contains it. In order to serialize an atom, one has to fully serialize the molecular structure it contains.
The notion of hierarchical serialization should not be confused with that of multiple archives serialization, which merely consists of splitting the information into multiple archives.



Multiple Archives Serialization

Another problem encountered in serializing data is the amount of information contained within an archive.

For example, an archive listing the atoms of a molecule contains all the properties of the atom in order to reconstruct in memory this atom. However, when analyzing the output of a program of the SBL with PALSE, the different analysis may not require all the pieces of information contained in the archive.

As a second example, consider an archive listing thousands of conformations of a molecule involving of the order of one thousand atoms: the corresponding archive contains millions of lines and is hard to navigate through.

A solution provided by the SBL consists in storing the information contained in a data structure in at least two different archives:

  • the main archive contains reduced information where the heavy part is replaced by a simple index,

  • the secondary archive contain all the heavy information, each piece being indexed, so as to be accessed from the first archive.
Consider the program $ \text{\vorlumeEP} $, which outputs a list of atoms, each endowed with information on its contribution to the surface area and volume of the molecule. The main archive consists in a list of indices together with the main contributions of each atom to the aforementioned surface area and volume. The secondary archive consists in the list of atoms (with the pieces of information contained say in the input PDB file), each with an associated index.


Usually, only the information in the main archive is used for further analysis, in particular with PALSE : for that reason, the main archive is generally in XML format. Subsequently, there is no particular requirement for the type of the secondary archive. In fact, if the data structure is not to be loaded by another program of the SBL, there is no need of a secondary archive.


Some pieces of information stored in the secondary archive may be necessary to complete some analysis in conjunction with the information contained in the main archive. In that case, to avoid loading the secondary archive, we provide mechanisms to duplicate such pieces on information in the main archive.


Saving in multiple archives. The class SBL::IO::T_Multiple_archives_serialization_xml_oarchive provides functionalities to save a serializable data structure into multiple archives. There are two ways to store a serializable data structure:

  • by providing the paths to both the main and the secondary archives when constructing an instance of SBL::IO::T_Multiple_archives_serialization_xml_oarchive,

  • by providing only the path to the main archive, in which case the information that is not in the main archive is lost.

In the last case, the data structure cannot be loaded since the information is lost.

Loading from multiple archives. The class SBL::IO::T_Multiple_archives_serialization_xml_iarchive provides functionalities to load a serializable data structure from multiple archives. The only way to load a serializable data structure is to specify the path to the main and secondary archives when constructing an instance of SBL::IO::T_Multiple_archives_serialization_xml_iarchive .

For more detailed information on multiple archives serialization, see the package Multiple_archives_serialization .