Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual


Authors: F. Cazals and A. Chevallier and T. Dreyfus


Describing a molecule: topology, geometry, biochemistry

The description of a molecule encompasses several aspects:

  • topology defined as a molecular graph: the vertices of the graph are the particles namely atoms or pseudo-atoms for coarse grain models; the edges of the graph are the covalent bonds connecting the particles.

    Note that the graph may not be connected, which corresponds to several molecules.

    See also Biomolecules and the PDB format for a discussion of topology issues with respect to molecular data formats.

    The topological representation of a molecule allows one to:

    • iterate on molecules,
    • iterate on particles, bonds, bond angles and torsion angles,
    • report tuples (pairs, triples, quadruples) used to compute potential energies,

    The topology is handled by the class SBL::CSB::T_Molecular_covalent_structure

  • geometry: the conformation i.e. the embedding in three dimensions of the molecular graph. The geometric representation of a molecule allows one to convert back and forth between Cartesian and internal coordinate representations, and to compute partial derivatives of potential energies wrt coordinates when performing energy minimization. The default representation is in cartesian coordinates with the class SBL::Models::T_Geometric_conformation_traits::Conformation . These coordinates can be converted into internal coordinates as explained in the package Molecular_coordinates

  • biochemistry: the properties qualifying the particles, from which potential energies are computed, see the package Molecular_potential_energy. It is handled by the concept ParticleInfo , as detailed below.

Amongst these three aspects, this package solely handles the topological ones.

Covalent data structures: default and optimized

When the covalent structure is used to compute the potential energy and the gradient of the energy of a large number of conformations, it becomes necessary to optimize its traversal – and the calculation of internal coordinates. The base data structure uses a graph for representing it, so that it is easy to visit particles and bonds. However, the following comments are in order:

  • visiting bond angles and torsion angles require some calculations, since they are not stored within the graph data structure;
  • supplemental filters from the parameters used for computing the energy may reduce the number of particles, bonds or angles to visit : this is particularly the case when 1-3 interactions are ignored, or when a force constant is considered null for a given bond or angle type.

For these reasons, two covalent data structures are provided:

File formats

Practically, the following file formats are used:

  • The PDB format: the reference format for biomolecules whose structure has been resolved and are deposited in the Protein Data Bank ( PDB website).

  • For molecules other than peptides and proteins, we use the MOL format – see section Beyond the PDB format and MOL format.

Implementation: default covalent structure – with graph representation

The first implementation provided uses internally a graph data structure to code particles and bonds, which comes with the following pros and cons:

  • (pro) The graph comes with topological operations
  • (cons) Access to the data members (internal coordinates in particular) are not optimized.

Three important data structures are provided by this package :

ParticleInfo concept

This concept is used to represent a particle in the covalent structure and to attach information to it.

Depending on the context, the relevant information and the necessary operations on a particle vary.

For example, the particle could belong to a small molecule or to a polypeptidic chain.

In this latter case, one may want to retrieve a particular atom (e.g. the Calpha carbon) – an irrelevant request if the molecule is not a polypeptide chain.

For this reason, the class SBL::CSB::T_Particle_info_traits< ParticleInfo > provides all the necessary types and operations depending on the model of ParticleInfo . Such a mechanism uses the C++ Partial Template Specialization : any structure can be used as a model of the concept ParticleInfo provided that the class SBL::CSB::T_Particle_info_traits is redefined for this particular model. The required operations depend on the context where it is used :

  • For any covalent structure, the requirements are given by the class SBL::CSB::T_Molecular_covalent_structure : SBL::CSB::T_Particle_info_traits , which must define two operators:

    • Less : comparator used to retrieve a vertex in the covalent structure from its info.

      bool operator()(const Particle_info& p, const Particle_info& q)const;

    • Printer : prints the information of the particle into a stream passed as argument.

      void operator()(std::ostream& out, const Particle_info& info)const;

  • For proteins, the requirements are given by the classes SBL::CSB::T_Molecular_covalent_structure_builder_for_proteins and SBL::IO::T_Molecular_covalent_structure_loader_from_PDB : SBL::CSB::T_Particle_info_traits must define three operators:

    • Finder : uses the covalent structure and biochemical data identifying a particle (chain identifier, residue sequence number, insertion code, atom name) to retrieve the corresponding vertex in the covalent structure;

      std::pair<bool, Particle_rep> operator()(const Covalent_structure& S, char chain_id, int res_id, std::string atom_name = "CA")const;

    • Builder : builds an info from the relevant biochemical data (chain identifier, residue sequence number, insertion code, atom name, or a ESBTL atom having all this data) and a terminal tag (positive if N-terminal, negative if C-terminal, null otherwise);

      Particle_info operator()(const Covalent_structure& S, char chain_id, int res_id, const std::string& res_name, const std::string& atom_name, int ter = 0)const;
      Particle_info operator()(const Covalent_structure& S, const Atom_type& a, const std::string& atom_name, int ter = 0)const;

    • Aliases implements a dictionnary of aliases for the atom names in a protein, in order to have a unique atomic naming convention – see atom-naming-convention;

      const std::string& operator()(const std::string& name)const;

  • For other molecules, the requirements are given by the class SBL::IO::T_Molecular_covalent_structure_loader_from_MOL : SBL::CSB::T_Particle_info_traits must define the operator Builder ;

Two models of ParticleInfo are provided :

  • As a simple structure, a std::string can be used for representing the information of a particle; this can be done by including the file "SBL/CSB/Particle_info_string.hpp"
Note that some atoms in the covalent structure may have no equivalent in the ESBTL data structure, e.g. because there is no representation of this atom in the loaded file representing the protein. This happens typically for missing hydrogen atoms, or portions of long side chains unresolved in a crystal structure. To be able to identify an atom from the particle info data structure even without the ESBTL data structure, the chain identifier, residue sequence number, insertion code and atom name are also directly stored within the SBL::CSB::T_Particle_info_for_proteins . structure.

: Comment on the following case: a system containing a protein and a drug is loaded. which ParticleInfo should we use? That case is not covered by remark 1.

The properties required for the class used are twofold:
  • comparable: this ensures that each instance of ParticleInfo uniquely identifies a particle;

  • streamable: this ensures that each particle in the covalent structure has an associated name, which is e.g. used when printing the covalent structure as a graph using Graphviz .

The concept can be represented by a simple string representing the name of the particle. When used for proteins, the name is built from the chain identifier, the residue sequence number, the insertion code, and the atom name. In doing so, it is possible to find particles in the covalent structure from biochemical information.
A more complete data structure is provided as SBL::CSB::Particle_info_for_proteins and defines three features :
  • it stores separatly all the information required for identifying the particle; this allows to use the covalent structure in the twin package Molecular_potential_energy where access to the biochemical information is required for associating the force field parameters;

Representing covalent structures: main data structure

Basically, the class SBL::CSB::T_Molecular_covalent_structure is a wrapper for a boost graph with accessors, modifiers and iterators over the graph. More precisely, it uses the boost subgraph data structure to offer the possibility of clustering the covalent structure, e.g by polypeptidic chains or by residues. There is a unique template parameter ParticleInfo representing the information attached to a particle, that is a simple string by default. An object of type ParticleInfo should be ordered, streamable and default constructible. Four main functionnality are available.

Building the covalent structure.

Adding a particle to the covalent structure is done using the method SBL::CSB::T_Molecular_covalent_structure::add_particle : this method takes a particle info representing that particle.

Adding a bond is done using the method SBL::CSB::T_Molecular_covalent_structure::add_bond : this method takes two vertices of the graph, and adds, if possible, an edge between the two vertices. It also takes optionnaly a bond type (simple by default, or double) to represent the covalence of the bond.

Establishing a correspondence with an embedded geometry specified by cartesian coordinates.

It is possible to map a geometric model over the covalent structure by simply mapping the vertices of the graph to the geometric representation of the particles.

A vertex is termed embedded if it is endowed with Cartesian coordinates. A covalent structure is termed embedded if all its vertices are embedded, and partially embedded otherwise.

The notion of partial embedding is especially important when loading polypeptide chains from PDB files, which in general do not contain coordinates for all atoms (hydrogen atoms, heavy atoms located in flexible regions).

In the SBL, a geometric model is represented by a conformation that is a D-dimensional point consisting of the cartesian coordinates of the particles of a molecule. By default, the ith inserted vertex is mapped to the ith particle in the conformation.

Given an embedded vertex, it is then possible to access x, y and z coordinates from the covalent structure from the methods SBL::CSB::T_Molecular_covalent_structure::get_x, SBL::CSB::T_Molecular_covalent_structure::get_y and SBL::CSB::T_Molecular_covalent_structure::get_z .

Iterating over the covalent structure.

It is possible to iterate over the particles, bonds, bond angles and torsion angles of the covalent structure. There are two modes of iteration : over all the entities, or over all the entities that are mapped with a geometric model. For example, if ones want to compute internal coordinates, it is only possible to do so from particles that are mapped. To switch between the modes, each iteration method has as argument an optional boolean tag (by default, it iterates over mapped particles).

Consider a set of particles associated with one or several connected components of the covalent structure. The subgraph induced by these particles is returned by SBL::CSB::T_Molecular_covalent_structure::get_sub_structure . The type of this graph is SBL::CSB::T_Molecular_covalent_structure . Note that stored information are not duplicated : modifying the sub-structure modifies also the parent covalent structure, and modifying the parent covalent structure modifies the sub-structure.

As an application, one can iterate on the individual molecules of a molecular system, as these correspond to the connected components of the covalent structure. To do so, one proceeds as follows :

Displaying the covalent structure.

The covalent structure can be dumped in a file using Graphviz through the method SBL::CSB::T_Molecular_covalent_structure::print . For large systems (e.g. large proteins), the high number of particles, the Graphviz software may take a time to create the output. Also note that the neato command from Graphviz has a better rendering than the more classical dot command in this case.

Building covalent structures

Building a covalent structure requires three pieces of information:

  • Attributes for vertices (type of vertices).

  • Attributes for edges (type of covalent bonds).

  • The topology itself. For example, when building the subgraph associated with an amino-acid, the topology of this amino-acid is required. this information can be implicit as in the case of amnino-acids and polypeptide chains, or can be explicit i.e. provided in the input file. This accounts for the two types of builders provided thereafter.

In the context of potential energy calculations – see Molecular_potential_energy, molecular dynamics software (e.g Gromacs) usually use parameter files including the topology of those molecules (e.g itp file format). We deliberately choose to separate the topology from those parameters, allowing to perform independently analysis and calculations over the covalent structure. In particular, the covalent structure is represented by a graph that can be used as input of combinatorial packages in CADS such as Betti_numbers or Earth_mover_distance

Implicit builders for polypeptide chains. The class SBL::CSB::T_Molecular_covalent_structure_builder_for_proteins is a functor building the covalent structure from an arbitrary data structure SBL::CSB::T_Molecular_covalent_structure_builder_for_proteins::Structure defining the polypeptidic chains and their residues. More precisely, a protein is represented by a mapping from a chain identifier to the chain itself, each chain being represented by a list of pairs (residue name, residue id). The graph of the covalent structure can be constructed at different scales : (0) with all atoms from all residues, (1) with only the heavy atoms, (2) with only the carbon alpha of each residue, or (3) with pseudo-atoms from the Martini coarse grain model. The scale factor is determined at the construction of the builder and is 0 by default. The only template parameter Covalentstructure is the covalent structure type.

The nomemclature for atoms in PDB files can be found here: IMGT / Amino acids: 3D model and atoms nomenclature. Should the PDB parser find an atom whose naming is unknown, a warning is issued.

For sequence alignement purposes, residues in PDB files are identified by their sequence number and their insertion code. In particular, sequence numbers are not necessarily successive, and several residues with the same sequence number and different insertion codes may coexist. For this reason, in the builder, residues are sorted by the atomic serial number of their first atom. In addition, a geometric predicate is used to check if the atom C of a residue makes a bond with the atom N of the next residue in the sequence. Practically, it is checked that the distance between these atoms is less than 2 $\AA$. If it is not the case, no bond is created in the covalent structure and the covalent structure has multiple connected components.

Explicit builder for other molecules.

Since there is no particular difficulty nor operation for loading molecules from the MOL format, building the covalent structure from a MOL file is directly done in the loader SBL::CSB::T_Molecular_covalent_structure_loader_from_MOL .

Loading atomic resolution covalent structures

Loader from PDB files.

The class SBL::IO::T_Molecular_covalent_structure_loader_from_PDB is a loader as defined in the Module_base package : it loads an input PDB , uses a builder to build a covalent structure, and maps the vertices of the covalent structure to the loaded atoms by order of their atomic sequence identifier.

The following comments are in order:

Specific atomic naming conventions.

  • Some atom names may have different conventions. However, it is important for further calculations (e.g potential energy calculation) to have a unique convention for each atom name. When required, we fixed the convention for the naming to be the one used within force field parameters. For example, hydrogen atoms bonded to the first carbon in the N-terminal cap called ACE, or the C-terminal cap called NME can be names XHH3 or HH3X, where X is replaced by 1, 2 or 3. In Amber force fields, it is named HH3X. To handle such cases, we fixed our convention on HH3X. This label is set by the operator Aliases defined in a specialization of the class SBL::CSB::T_Particle_info_traits .

Loader from MOL files.

The class SBL::IO::T_Molecular_covalent_structure_loader_from_MOL is a loader as defined in the Module_base package : it simply loads an input MOL file, and directly builds the covalent structure from the loaded data and maps the vertices to the laoded particles. It has a unique template parameter MolecularCovalentStructure , that is the representation of the covalent structure, by default SBL::CSB::T_Molecular_covalent_structure <> (see package Molecular_covalent_structure)

Loading coarse grain covalent structures

It is possible to set the coarse level used by the builder to create the covalent structure. This class has four template parameters, all having a default type :

Implementation: advanced features

Canonical representations

Manipulating molecular coordinates (package Molecular_coordinates) or computing potential energies (package Molecular_potential_energy) benefits from a canonical representation of internal coordinates.

The method SBL::CSB::T_Molecular_covalent_structure::get_canonical_rep returns the cannonized representation of a bond, a bond angle or a dihedral angle.

Canonical representation of a bond. Such a representation is obtained by ordering its two vertices – that is the particle_rep used to represent the vertices.

Canonical representation of a valence angle. Such a representation is obtained by ordering the first and third vertices – since the middle one is imposed.

Canonical representation of a dihedral angle. This case covers two sub-cases (see also the package Molecular_coordinates):

  • Proper angle: second and third vertices sorted;
  • Improper angle: central atom first, remaining three sorted.

Implementation: advanced features

The class SBL::CSB::T_Molecular_covalent_structure_optimal< ParticleInfo > is an optimized version of the covalent structure data structure. In particular, all entities (particles, bonds and angles) that are used in the computation of the energy are stored in a vector, ensuring a constant time access each time.

In order to use this class, once has to first build a covalent structure with the class SBL::CSB::T_Molecular_covalent_structure, then use the class SBL::CSB::T_Molecular_potential_energy_structure_parameters_traits_optimal to build the optimized version of the covalent structure data structure. See the package Molecular_potential_energy for examples and more details.

The functionalities depending on the graph structure which are not used in the potential energy calculation are unavailable with the optimized data structure. This includes using the graph data structure itself, dynamically changing the graph (adding particles or bonds), and printing the covalent structure in the .dot format.


The following example shows how to load a PDB file and build the corresponding covalent structure at a given scale. It then output a dot file in the Graphviz format to check the covalent structure.