Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Molecular_covalent_structure

Authors: F. Cazals and A. Chevallier and T. Dreyfus

The description of molecular systems encompasses several aspects (geometry, topology, biophysics), see SBL: terminology and concepts.

This package solely handles the topology, a.k.a. the molecular covalent structure (MCS), and is organized as follows:

  • section mcs-inst: default instantiations of the MCS data structure,
  • section mcs-particle-info: discusses the models one may use to represent (pseudo-)atoms,
  • section mcs-with-graph: introduces the MCS data structure using a graph,
  • section mcs-builders: presents builders for MCS,
  • section mcs-loaders: presents loaders for MCS,
  • section mcs-adv-features: presents advance features.

Introduction and concepts

Design

In all generality, a PDB or mmCIF file contains proteins and nucleic acids, and possibly other molecular species. We note that proteins and nucleic acids are special cases of biopolymers–see package Linear_polymer_representation, namely molecules assembled by gluing units also called residues. We also note that depending on the biophysical application of interest, various pieces of information may be attached to a given atom.

To present the MCS data structure, we distinguish four levels:

  • Particle representation: used to represent (pseudo-) atoms and associated pieces of information,
  • Molecular covalent data structure: topological representation using a graph by default,
  • MCS builders: build MCS depending on the molecule type (protein, nucleic acid, other) and coarse grain level (atomic, coarse grain),
  • MCS loaders: bridge the gap between input files and MCS using builders.

More specifically:

  1. Atoms/particles: the type of atom/pseudo atom used, and the associated pieces of information. In fact, in the sequel, Atom is referred to as ParticleInfo.

  2. Molecular covalent structure (MCS): the graph used to represent the topology. This graph is typically built from the sequence of units of the molecule (a.a. for proteins, bases for nucleic acids). Therefore, we provide MCS for boths.

  3. The builder: the algorithm building and filling a specific MCS. Builder are therefore dedicated to proteins or nucleic acids which are linear biopolymers – see package Linear_polymer_representation.

  4. The loader: the class in charge of loading a PDB/mmCIF file so as to build the molecules present in it. Because a file may contain both proteins and nucleic acids, a loader is parameterized by a builder for proteins, and a builder for nucleic acids.
Three important data structures are provided by this package :


(Advanced/critical) Atoms and indices

The manipulation of atoms involves the following sets of indices:

  • PDB serial number. The integer used to index the ATOM in a PDB file. Note that entities with a serial number are: ATOM, HETATM, ANISOU, and TER.

  • Atom serial number/atomid. The index assigned to the atom generally by the parser loading the PDB/cif file. May not coincide with the PDB serial number, due in particular to serial numbers of TER tags used to separate chains.

    • Vanilla parser exploiting directly the PDB line: atom serial numbers may not be consecutive (TER tags, missing atoms, etc)
    • Parsing files with libcifpp, see below: atom serial numbers are consecutive and start at 1.

  • Atom linear position: atom position in the structure, once gaps have been removed. A plain linear ordering from 0 to n-1, if there are n atoms. This index is used to access the position i.e. the Cartesian coordinates of the atom in the conformation of the molecule – see below. In using libcifpp, the linear position is the atom serial number minus 1.

  • Particle representation, aka Vertex id: vertex id of the atom in the boost graph used to represent the molecular topology. This type is provided by the graph library used, boost graph in our case.

  • Particle info: the pieces of information associated to an atom.
The understand the distinction between serial number an atomid, consider libcifpp. One finds in src/pdb/pdb2cif.cpp:
class PDBFileParser:
int mAtomID = 0; (class member)
which is incremented per atom in PDBFileParser::ParseCoordinate, and written into atom_site.id:
src/pdb/pdb2cif.cpp: inside ParseCoordinate(int modelNr)
++mAtomID;
getCategory("atom_site")->emplace({ ..., { "id", mAtomID }, ... });
Which shows that the atom_site.id is assigned from an internal counter, not from the PDB serial number.


Comment on this

// Mapping particleInfo to graph vertexId/particleRep
ParticleInfo_to_particleRep_map_type m_particleInfo_to_particleRep;

the order in which particles are created while building the graphs does not in general correspond to the ordering of atoms in the file. Whence:

// Mapping the vertexId/particleRep (an int, actually), to the linear position used in the conformation
std::vector<int> m_particleRep_to_linearPosition;
// Mapping the atom serial number from the molecular system to the particleRep
AtomSerialNumber_to_particleRep_map_type m_atomSerialNumber_to_particleRep;
//std::map<int, Particle_rep> m_atomSerialNumber_to_particleRep;

MCS: creation and post-processing

The molecular covalent structure (MCS) of a biomolecule is a graph, possibly with several connected components. The MCS is created and then edited. Ultimately, a mapping between the topology and geometry (embedding) is realized.

Creating the MCS.

The CS data structure is represented as a boost subgraph. Such a graph contains vertices and edges. Each vertex (containing the information about a particle) is accessed via a so-called particle representation (particle rep), and an edge (modeling a covalent bond) via a bond representation or a pair of particle reps.

The MCS of a biomolecule is built upon the sequence of units (amino acids, bases) found in the input file.

  • PDB/mmCIF file with sequence specification: sequence specification is used as provided.
  • PDB/mmCIF file listing atoms only: the sequence specification is retrieved from resids associated with the atoms.

In both cases, a post-processing step is performed to check that consecutive a.a. in the sequence specification are consecutive by ensuring that the resids are so.

Missing amino acids (aka gaps) are allowed via the SBL::IO::T_Molecular_covalent_structure_loader::set_allow_incomplete_chains(bool) method. In that case, the number of connected components of the chain is 1 plus the number of gaps.


Atoms and embedded atoms.

The covalent structure represents the graph of a molecular system, linking residues for proteins or nucleic acids. However, in a classical structure obtained by say X ray crystallography, selected atoms may be missing – e.g. hydrogen atoms or atoms located in flexible regions. Such are atoms are found in the covalent structure, but their geometry is not accessible. This motivates the following

A vertex is termed embedded if it is endowed with Cartesian coordinates. A covalent structure is termed embedded if all its vertices are embedded, and partially embedded otherwise.


Editing the MCS.

Once created the MCS undergoes the following editing steps:

  • Ionisation and water molecules. Water molecules are added, and (de-)protonation is taken care of in line with the information provided in the PDB/mmCIF files.
  • SS bridges are retrieved from PDB/mmCIF files. Note that SS bonds between different polypeptide chains reduce the number of connected components, while SS within a chain increase the cycle basis size.
SS bonds can also be found geometrically, using the SBL::IO::T_Molecular_covalent_structure_loader::set_ss_bond_search(bool) method. This inference step computes the distances between pairs of S atoms using their coordinates in the PDB/mmCIF file.


Creating maps.

The following maps are then set to make it possible to connect the molecular graph with coordinates. Atomids may not be consecutive. To ease the processing, we define a consecutive version of atomids as follows:

The sorted_atom_serial_number_index is the index of an atom in the sorted list of atom serial numbers.


The maps of interest are created in two steps:

  • Iterate on all atoms of the molecular system (first model), and create the map containing the atomids.
    Note that iterating on the latter map yields a consecutive indexing which is precisely the sorted_atom_serial_number_index.

  • Iterate on all atoms of the molecular system; retrive the particle representation $p$; then, map $p$ to the atomid.

MCS: accessors

Accessors for graphs and subgraphs.

As a graph, the MCS can be accessed and visited using iterators. Broadly, there are two main settings:

  • complete graph: access the MCS as a whole.
  • induced subgraph: access the Induced subgraph defined by an atom set given as a container of particle reps. Note that this set of atoms may belong to one or several chains.
In the induced subgraph case, indices (particle reps) of the vertices belong to the range 0..num_vertices of the subgraph.For convenience, the particular case SBL::CSB::T_Molecular_covalent_structure::get_chain is also provided.


An important piece of information for certain tasks is the enumeration of loops in the MCS. This is e.g. the case when sampling conformations of multiloop structures [65] . This task is taken care of by the package Graph_cycles_basis.


Other accessors.

The MCS data structure offers a panel of iterators, including:

  • particles and bonds,
  • Cartesian coordinates,

Molecular Covalent Structure: instantiations, builders and loaders

Hierarchy of instantiations

To understand the various instantiations provided, let us start with the class used inside nodes of the graph data structure. This class is itself parameterized by the so-called particle traits:

template <class ParticleTraits> struct T_Particle_info_biomolecules;

Various particle traits are provided to accommodate annotation rich instantiations – file Molecular_covalent_structure_builder_instantiations.hpp :

Defines a generic serializable atom with annotations (default is name, radius and optional annotation...
Definition Atom_with_flat_info_and_annotations_traits.hpp:119
Traits class defining atoms traits with annotations (at least name and radius). Traits class defining...
Definition Atom_with_hierarchical_info_and_annotations_traits.hpp:131
Traits class defining atoms traits (biophysical and geometric properties). Traits class defining atom...
Definition Atom_with_hierarchical_info_traits.hpp:194
Traits class defining atoms traits (biophysical and geometric properties). Traits class defining atom...
Definition Atom_with_flat_info_traits.hpp:599

Using the previous yields four lineages of instantiations, referred to as FIT, FIAT, HIT, and HIAT. Here is the lineage of instantiations using the FIT:

// 1. Default atom
// 2. Associated covalent structure
// 3. Builder for proteins and nucleic acids, and then biomolecules
typedef SBL::CSB::T_Molecular_covalent_structure_builder_for_nucleic_acids<SBL::CSB::MCS_FIT> MCS_builder_NA_FIT;
typedef SBL::CSB::T_Molecular_covalent_structure_builder_for_biomolecules<MCS_builder_proteins_FIT, MCS_builder_NA_FIT> MCS_builder_biomols_FIT;
// 4. Loader parameterized by the builder
Builds a covalent structure from a given model.
Definition Molecular_covalent_structure_builder_proteins.hpp:29
Representation of the covalent structure of a molecular conformation.
Definition Molecular_covalent_structure.hpp:60
Loader for covalent structures from PDB / mmCIF files.
Definition Molecular_covalent_structure_loader.hpp:34
The following are example packages using the previous instantiations:


ParticleInfo concept and its use for biomolecules

Design

We have seen in the previous section that different applications use different types of atoms. This mechanism actually uses two levels:

  • ParticleTraits: the particle type used, namely atom or pseudo-atom.

We analyse these building blocks in the sequel.

The ParticleInfo concept

The class SBL::CSB::T_Particle_info_traits< ParticleTraits > provides all the necessary types and operations depending on the model of ParticleTraits . Such a mechanism uses the C++ Partial Template Specialization : any structure can be used as a model of the concept ParticleTraits provided that the class SBL::CSB::T_Particle_info_traits is redefined for this particular model. The required operations depend on the context where it is used :

  • For any covalent structure, the requirements are given by the class SBL::CSB::T_Molecular_covalent_structure : SBL::CSB::T_Particle_info_traits , which must define two operators:

    • Less : comparator used to retrieve a vertex in the covalent structure from its info.

      bool operator()(const Particle_info& p, const Particle_info& q)const;

    • Printer : prints the information of the particle into a stream passed as argument.

      void operator()(std::ostream& out, const Particle_info& info)const;
    Dummy case with strings. As a simple structure, a std::string can be used for representing the information of a particle; this can be done by including the file "SBL/CSB/Particle_info_string.hpp" .


ParticleInfo for biomolecules

For proteins and nucleic acids, the structure SBL::CSB::T_Particle_info_biomolecules wraps a Particle_type defined from the concept ParticleTraits. Usually, one can use the model SBL::Models::T_Atom_with_hierarchical_info_traits, that uses the Molecular_system atom type as Particle_type. This can be done by including the file "SBL/CSB/Particle_info_proteins.hpp".

SBL::CSB::T_Particle_info_traits must define three operators:

  • Finder : uses the covalent structure and biochemical data identifying a particle (chain identifier, residue sequence number, insertion code, atom name) to retrieve the corresponding vertex in the covalent structure;

    std::pair<bool, Particle_rep> operator()(const Covalent_structure& S, char chain_id, int res_id, std::string atom_name = "CA")const;

  • Builder : builds an info from the relevant biochemical data (chain identifier, residue sequence number, insertion code, atom name, or a Molecular_system atom having all this data) and a terminal tag (positive if N-terminal, negative if C-terminal, null otherwise);

    Particle_info operator()(const Covalent_structure& S, char chain_id, int res_id, const std::string& res_name, const std::string& atom_name, int ter = 0)const;
    Particle_info operator()(const Covalent_structure& S, const Atom_type& a, const std::string& atom_name, int ter = 0)const;

  • Aliases implements a dictionnary of aliases for the atom names in a protein, in order to have a unique atomic naming convention – see atom-naming-convention;

    const std::string& operator()(const std::string& name)const;

For other molecules, the requirements are given by the class SBL::IO::T_Molecular_covalent_structure_loader_from_MOL : SBL::CSB::T_Particle_info_traits must define the operator Builder ;

Note that some atoms in the covalent structure may have no equivalent in the Molecular_system data structure, e.g. because there is no representation of this atom in the loaded file representing the protein. This happens typically for missing hydrogen atoms, or portions of long side chains unresolved in a crystal structure. To be able to identify an atom from the particle info data structure even without the Molecular_system data structure, the chain identifier, residue sequence number, insertion code and atom name are also directly stored within the SBL::CSB::T_Particle_info_biomolecules . structure.


Molecular Covalent Structure with graph: representation and accessors

Data structure and internal maps

The default implementation provided uses internally a graph data structure to code particles and bonds, which comes with the following pros and cons:

  • (pro) The graph comes with topological operations
  • (cons) Access to the data members (internal coordinates in particular) are not optimized.

Basically, the class SBL::CSB::T_Molecular_covalent_structure is a wrapper for a boost graph with accessors, modifiers and iterators over the graph. More precisely, it uses the boost subgraph data structure to offer the possibility of clustering the covalent structure, e.g by polypeptidic chains or by residues. There is a unique template parameter ParticleInfo representing the information attached to a particle, that is a simple string by default. An object of type ParticleInfo should be ordered, streamable and default constructible. Four main functionnality are available.

Building the covalent structure.

Adding a particle to the covalent structure is done using the method SBL::CSB::T_Molecular_covalent_structure::add_particle : this method takes a particle info representing that particle.

Adding a bond is done using the method SBL::CSB::T_Molecular_covalent_structure::add_bond : this method takes two vertices of the graph, and adds, if possible, an edge between the two vertices. It also takes optionnaly a bond type (simple by default, or double) to represent the covalence of the bond.

Mapping particleInfo to graph particleRep.

To retrieve particle reps from particleInfos, we define the map

typedef std::map<Particle_info, Particle_rep, Particle_info_less> ParticleInfo_to_particleRep_map_type;
ParticleInfo_to_particleRep_map_type m_particleInfo_to_particleRep;
T_Molecular_covalent_structure<ParticleInfo>::Particle_rep
add_particle(const Particle_info& info){
...
this->m_particleInfo_to_particleRep.insert(std::make_pair(this->m_graph[p], p));
}

Mapping particleReps to atomic coordinates. In the SBL, a geometric model is represented by a conformation that is a D-dimensional point consisting of the cartesian coordinates of the particles of a molecule. By default, the ith inserted vertex is mapped to the ith particle in the conformation.

Given an embedded vertex, it is then possible to access x, y and z coordinates from the covalent structure from the methods SBL::CSB::T_Molecular_covalent_structure::get_x, SBL::CSB::T_Molecular_covalent_structure::get_y and SBL::CSB::T_Molecular_covalent_structure::get_z .

Since a particle reps are in 0..n-1, with position the index in the molecular system as defined by the loader in the function

// Mapping particleRep (an int, actually), to the linear position used in the conformation
std::vector<int> m_particleRep_to_linearPosition;
// used in the function
void set_particle_linear_position(Particle_rep p, int position) {this->m_particleRep_to_linearPosition[p] = position;}
// Called from
int T_Molecular_covalent_structure_loader<Molecular_covalent_structure_builder_>::
make_covalent_structure_conformation_mapping(const typename Molecular_system::Model& model, Molecular_covalent_structure& mcs) {
mcs.set_particle_linear_position(p.second, i);
}
typedef std::map<int, Particle_rep> AtomSerialNumber_to_particleRep_map_type;

Mapping atom serial number to particle rep.

To locate a corresponding covalent structure node from a Molecular_system::Atom:

typedef std::map<int, Particle_rep> AtomSerialNumber_to_particleRep_map_type;
AtomSerialNumber_to_particleRep_map_type m_atomSerialNumber_to_particleRep;
// Used in the function
void set_atom_serial_number_particle(int atom_serial_number, Particle_rep p) {
this->m_atomSerialNumber_to_particleRep[atom_serial_number] = p;}
// Called from
int T_Molecular_covalent_structure_loader<Molecular_covalent_structure_builder_>::
make_covalent_structure_conformation_mapping(const typename Molecular_system::Model& model, Molecular_covalent_structure& mcs) {
...
mcs.set_atom_serial_number_particle(it->first, p.second);
}
The other way is easy since each covalent structure node is shipped with a pointer to a Molecular_system::Atom. get_sub_structure() is also modified to construct the atomSerialNumber_to_particleRep map of the substructure's particle rep to the molecular system.


Accessors

Iterating over the covalent structure.

It is possible to iterate over the particles, bonds, bond angles and torsion angles of the covalent structure. There are two modes of iteration : over all the entities, or over all the entities that are mapped with a geometric model. For example, if ones want to compute internal coordinates, it is only possible to do so from particles that are mapped. To switch between the modes, each iteration method has as argument an optional boolean tag (by default, it iterates over mapped particles).

Consider a set of particles associated with one or several connected components of the covalent structure. The subgraph induced by these particles is returned by SBL::CSB::T_Molecular_covalent_structure::get_sub_structure . The type of this graph is SBL::CSB::T_Molecular_covalent_structure . Note that stored information are not duplicated : modifying the sub-structure modifies also the parent covalent structure, and modifying the parent covalent structure modifies the sub-structure.

As an application, one can iterate on the individual molecules of a molecular system, as these correspond to the connected components of the covalent structure. To do so, one proceeds as follows :

Displaying the covalent structure.

The covalent structure can be dumped in a file using Graphviz through the method SBL::CSB::T_Molecular_covalent_structure::print . For large systems (e.g. large proteins), the high number of particles, the Graphviz software may take a time to create the output. Also note that the neato command from Graphviz has a better rendering than the more classical dot command in this case.

Molecular Covalent Structure with graph: builders

The builder for biomolecules. The builder uses individual builders for proteins and nucleic acids, which themselves inherit common features from a generic builder for linear polymers.

Design


Building proteins and nucleic acids. To build the MCS of a biomolecule, one needs to know the sequences of residues: sequence of a.a. for a protein, and sequence of bases for a nucleic acid.

As noticed earlier, a given input file may contain a complex with both types of biomolecules. For this reason, the class SBL::CSB::T_Molecular_covalent_structure_builder_for_biomolecules is parameterized by two templates, namely one builder for proteins and one builder for nucleic acids.

The builder for biomolecules also contains the following enum to distinguish between both cases:

enum MCS_builder_type {BUILD_PROT, BUILD_NA};

Upon processing a sequence (vector) of residue, the flag is properly positioned, so that the function make_residue() call the appropriate builder (for proteins or nucleic acids). As indicated below, the flag is set using the number of chars of the residue name.

The following is an excerpt from the file Core/Molecular_covalent_structure/include/SBL/CSB/Molecular_covalent_structure_builder_biomolecules.hpp

bool T_Molecular_covalent_structure_builder_for_biomolecules<CovStructBuilderForProteins,CovStructBuilderForNucleicAcids>::
operator()(const std::map<std::string, std::vector<Residue_info>>& chain_residue_info, Molecular_covalent_structure& S) {
for(auto& residue_infos : chain_residue_info) {
// decide (protein, DNA, RNA) and call the proper constructor
unsigned num_chars = (residue_infos.second)[0].name.length();
std::cerr << (residue_infos.second)[0].name << " " << num_chars << "\n";
// Note: Further distinction between RNA and DNA is made in the nucleic_acid chain builder
bool b_tmp;
if (num_chars == 3)
m_build_type = BUILD_PROT;
else if (num_chars == 1 || num_chars == 2)
m_build_type = BUILD_NA;
}


Required pieces of information. Independently from the biomolecular type, building a covalent structure requires three pieces of information:

  • Attributes for vertices (type of vertices).
  • Attributes for edges (type of covalent bonds).
  • The topology itself. For example, when building the subgraph associated with an amino-acid, the topology of this amino-acid is required. this information can be implicit as in the case of amnino-acids and polypeptide chains, or can be explicit i.e. provided in the input file. This accounts for the two types of builders provided thereafter.
In the context of potential energy calculations – see Molecular_potential_energy, molecular dynamics software (e.g Gromacs) usually use parameter files including the topology of those molecules (e.g itp file format). We deliberately choose to separate the topology from those parameters, allowing to perform independently analysis and calculations over the covalent structure. In particular, the covalent structure is represented by a graph that can be used as input of combinatorial packages in CADS such as Betti_numbers or Earth_mover_distance


Generic builder for biomolecules


File SBL/CSB/Molecular_covalent_structure_builder_generic.hpp.

SBL::CSB::T_Molecular_covalent_structure_builder_genericMolecular_covalent_structure_, SpecializedCSBuilder> : a generic addition of nodes / edges to a covalent structure assuming a polymer of units also called residues.

The SpecializedCSBuilder must provide 3 operations: make_first_residue(), make_residue(), make_last_residue().

In doing so, nodes and edges are added to a covalent structure passed as argument.

Builders for polypeptide chains


File SBL/CSB/Molecular_covalent_structure_builder_proteins.hpp. Specialized builder for proteins.

the class T_Molecular_covalent_structure_builder_for_proteins<Molecular_covalent_structure_> implements make_backbone_unit() (proteins: N-CA-C), make_first_residue(), make_residue(), make_last_residue() to build a polymer of amino acids.

The class SBL::CSB::T_Molecular_covalent_structure_builder_for_proteins is a functor building the covalent structure from an arbitrary data structure SBL::CSB::T_Molecular_covalent_structure_builder_for_proteins::Structure defining the polypeptidic chains and their residues. More precisely, a protein is represented by a mapping from a chain identifier to the chain itself, each chain being represented by a list of pairs (residue name, residue id). The graph of the covalent structure can be constructed at different scales : (0) with all atoms from all residues, (1) with only the heavy atoms, (2) with only the carbon alpha of each residue, or (3) with pseudo-atoms from the Martini coarse grain model. The scale factor is determined at the construction of the builder and is 0 by default. The only template parameter Covalentstructure is the covalent structure type.

The nomemclature for atoms in PDB files can be found here: IMGT / Amino acids: 3D model and atoms nomenclature. Should the PDB/mmCIF parser find an atom whose naming is unknown, this atom is created in the covalent structure but is not embedded.


For sequence alignement purposes, residues in PDB files are identified by their sequence number and their insertion code. In particular, sequence numbers are not necessarily successive, and several residues with the same sequence number and different insertion codes may coexist. For this reason, in the builder, residues are sorted by the atomic serial number of their first atom. In addition, a geometric predicate is used to check if the atom C of a residue makes a bond with the atom N of the next residue in the sequence. Practically, it is checked that the distance between these atoms is less than 2 $\ang$. If it is not the case, no bond is created in the covalent structure and the covalent structure has multiple connected components.


Builders for nucleic acids


File SBL/CSB/Molecular_covalent_structure_builder_nucleic_acids.hpp . Specialized builder for nucleic acids

the class T_Molecular_covalent_structure_builder_for_nucleic_acids<Molecular_covalent_structure_> implements make_backbone_unit(), make_first_residue(), make_residue(), make_last_residue() to build a polymer of bases.

Builders for biomolecules


File SBL/CSB/Molecular_covalent_structure_builder_biomolecules.hpp. The class T_Molecular_covalent_structure_builder_for_biomolecules<CovStructBuilderForProteins,CovStructBuilderForNucleicAcids> is parameterized by two specialized builders for proteins and nucleic acids respectively.

It contains two data members corresponding to these builders.

It main operator

operator()(const std::map<std::string, std::vector<Residue_info>>& chain_residue_info, Molecular_covalent_structure& S)

processes a map, whose (key,value) pairs give access to a vector of residues for each chain. Upon processing a vector, the length of the name of the first residue is inspected, and used to decide which specialized builder must be called.

In practice, this class is instantiated with the two specialized builders, which are themselves instantiated with the same covalent structure:

typedef SBL::CSB::T_Molecular_covalent_structure_builder_for_biomolecules<Cov_struct_builder_proteins, Cov_struct_builder_nucleic_acids> Default_molecular_covalent_structure_builder;

This design makes it possible to create a graph containing different connected components for proteins and nucleic acids. It is also possible to edit this graph and create a covalent bond between both, if such a bond exists.

NB: Default_molecular_covalent_structure_builder is the builder to use in practice.

Builders for other molecular species

Since there is no particular difficulty nor operation for loading molecules from the MOL format, building the covalent structure from a MOL file is directly done in the loader SBL::CSB::T_Molecular_covalent_structure_loader_from_MOL .

MCS Loaders

Design

The class T_Molecular_covalent_structure_loader is the orchestra conductor.

To get into the mood, inspect: T_Molecular_covalent_structure_loader<Molecular_covalent_structure_builder_>:: load_molecular_covalent_structures(void) {

  • Step 1. creation of the covalent structure.

  • Step 2. to set the atomic coordinates, we iterate on the this function iterates on all atoms, and for each: grabs the corresponding graph node using Particle_info, and if the atom is found in the molecular system, its coordinates are set


Step 1: covalent graph creation. The constructor

Molecular_covalent_structure_builder mcs_builder(Self::m_coarse_level, Self::m_allow_incomplete_chains);

uses

return S.add_particle(this->m_info_builder(S, chain_name, res_id, ins_code, residue_name, atom_name, ter));

The add_particle comes from T_Molecular_covalent_structure<ParticleInfo>::add_particle(const Particle_info& info), which does 2 critical things:

  1. creation of a new graph node and setting of its attributes (Particle_info& info)
  2. creation of an entry <particle_info, Particle_rep> in the map m_particleInfo_to_particleRep NB: this is critical since when setting coordinates with the function include/SBL/IO/Molecular_covalent_structure_loader.hpp make_covalent_structure_conformation_mapping(model, mcs);

    NB2: make_covalent_structure_conformation_mapping() now adds entries to the map atomSerialNumber_to_particleRep for each atom in the molecular system


Step 2: setting coordinates.
The critical function is

int index = Self::make_covalent_structure_conformation_mapping(model, mcs);
???
std::pair<bool, typename Molecular_covalent_structure::Particle_rep> p = Self::m_info_finder(mcs, s1, s2, s3, s4);

since it returns the graph node to which the coords will be associated. this graph node must have been set by main step 1.

Main files and classes


File SBL/IO/Molecular_covalent_structure_loader.hpp.

the main class, which creates the covalent structures, and associates coordinates.

File formats

fc: comment

File loaded into a molecular system: Molecular_system

File formats

Practically, the following file formats are used:

  • The (legacy) PDB format: the reference format for biomolecules whose structure has been resolved and are deposited in the Protein Data Bank ( PDB website).

  • The mmCIF file format.

  • For molecules other than peptides and proteins, we use the MOL format – see section Beyond the PDB format and MOL format.

Loading atomic resolution covalent structures

As noted above, loaders for MCS depend on loaders for molecular systems from the package Molecular_system. Such loaders use the libcifpp library.

Loader from PDB/mmCIF files.

The class SBL::IO::T_Molecular_covalent_structure_loader is a loader as defined in the Module_base package : it loads an input PDB/mmCIF, uses a builder to build a covalent structure, and maps the vertices of the covalent structure to the loaded atoms by order of their atomic sequence identifier.

The following comments are in order:

  • Water molecules are not loaded by default. To change this behaviour, one can use the methods SBL::IO::T_Molecular_covalent_structure_loader::set_loaded_water. Then, water molecules are added as independent connected components in the covalent structure graph.

  • Hetero atoms are not loaded by default. In the future, the flag SBL::IO::T_Molecular_covalent_structure_loader::set_loaded_hetatoms will be used to load them. Note that for hetero atoms which are not monoatomic ions, the connectivity is unknown.

  • As note above, disulfide bonds are added to the built covalent structure, see the package Pointwise_interactions. This feature is available by default but can be turned off using the method SBL::IO::T_Molecular_covalent_structure_loader::set_without_disulfide_bonds .

  • It is possible to change the maximal authorized length of the bond between the C of a residue and the N of the next residue using the method SBL::IO::T_Molecular_covalent_structure_loader::set_max_bond_distance.

Specific atomic naming conventions.

  • Some atom names may have different conventions. However, it is important for further calculations (e.g potential energy calculation) to have a unique convention for each atom name. When required, we fixed the convention for the naming to be the one used within force field parameters. For example, hydrogen atoms bonded to the first carbon in the N-terminal cap called ACE, or the C-terminal cap called NME can be names XHH3 or HH3X, where X is replaced by 1, 2 or 3. In Amber force fields, it is named HH3X. To handle such cases, we fixed our convention on HH3X. This label is set by the operator Aliases defined in a specialization of the class SBL::CSB::T_Particle_info_traits .

Loader from MOL files.

The class SBL::IO::T_Molecular_covalent_structure_loader_from_MOL is a loader as defined in the Module_base package : it simply loads an input MOL file, and directly builds the covalent structure from the loaded data and maps the vertices to the laoded particles. It has a unique template parameter MolecularCovalentStructure , that is the representation of the covalent structure, by default SBL::CSB::T_Molecular_covalent_structure <> (see package Molecular_covalent_structure)

Loading coarse grain covalent structures

It is possible to set the coarse level used by the builder to create the covalent structure. This class has four template parameters, all having a default type :

Implementation: advanced features

Canonical representations

Manipulating molecular coordinates (package Molecular_coordinates) or computing potential energies (package Molecular_potential_energy) benefits from a canonical representation of internal coordinates.

The method SBL::CSB::T_Molecular_covalent_structure::get_canonical_rep returns the cannonized representation of a bond, a bond angle or a dihedral angle.


Canonical representation of a bond. Such a representation is obtained by ordering its two vertices – that is the particle_rep used to represent the vertices.


Canonical representation of a valence angle. Such a representation is obtained by ordering the first and third vertices – since the middle one is imposed.


Canonical representation of a dihedral angle. This case covers two sub-cases (see also the package Molecular_coordinates):

  • Proper angle: second and third vertices sorted;
  • Improper angle: central atom first, remaining three sorted.

Covalent structure with optimized access

So far, we have described a covalent structure based on a graph data structure.

When the MCS is used to compute the potential energy and the gradient of the energy of a large number of conformations, it becomes necessary to optimize its traversal – and the calculation of internal coordinates. The base data structure uses a graph for representing it, so that it is easy to visit particles and bonds. However, the following comments are in order:

  • visiting bond angles and torsion angles require some calculations, since they are not stored within the graph data structure;
  • supplemental filters from the parameters used for computing the energy may reduce the number of particles, bonds or angles to visit : this is particularly the case when 1-3 interactions are ignored, or when a force constant is considered null for a given bond or angle type.

For these reasons, two covalent data structures are provided:

  • a data structure with a rich representation (a graph), yet slow access. See section mcs-default .
  • a data structure with a simpler representation, yet providing more efficient access. See section mcs-optimized .

The class SBL::CSB::T_Molecular_covalent_structure_optimal< ParticleInfo > is an optimized version of the covalent structure data structure. In particular, all entities (particles, bonds and angles) that are used in the computation of the energy are stored in a vector, ensuring a constant time access each time.

In order to use this class, once has to first build a covalent structure with the class SBL::CSB::T_Molecular_covalent_structure, then use the class SBL::CSB::T_Molecular_potential_energy_structure_parameters_traits_optimal to build the optimized version of the covalent structure data structure. See the package Molecular_potential_energy for examples and more details.

The functionalities depending on the graph structure which are not used in the potential energy calculation are unavailable with the optimized data structure. This includes using the graph data structure itself, dynamically changing the graph (adding particles or bonds), and printing the covalent structure in the .dot format.


Examples

fc: example outdated; use instantiations

The following example shows how to load a PDB file and build the corresponding covalent structure at a given scale. It then output a dot file in the Graphviz format to check the covalent structure.

#include <iostream>
#include <sstream>
#include <iterator>
#include <algorithm>
/*
typedef SBL::Models::T_Atom_with_hierarchical_info_traits<> Particle_traits;
typedef SBL::CSB::T_Particle_info_for_proteins<Particle_traits> Particle_info;
typedef SBL::CSB::T_Particle_info_traits<Particle_info> Particle_info_traits;
typedef SBL::CSB::T_Molecular_covalent_structure<Particle_info> Covalent_structure;
typedef SBL::CSB::T_Molecular_covalent_structure_builder_for_proteins<Covalent_structure> Covalent_structure_builder;
typedef SBL::IO::T_Molecular_covalent_structure_loader<Covalent_structure_builder> Covalent_structure_loader;
*/
// The biomols builder, which handles both proteins and NA
#include <SBL/CSB/Molecular_covalent_structure_instantiations.hpp>
// Recall that the MCS_loader inherits from the Molecular_system_loader
typedef SBL::CSB::MCS_loader_biomols_HIT MCS_loader;
typedef typename MCS_loader::Molecular_covalent_structure_builder::Molecular_covalent_structure Covalent_structure;
int main(int argc, char *argv[])
{
if(argc < 2) {
return -1;
}
//Loads the covalent structure from a PDB file.
MCS_loader mcsl;
mcsl.set_loaded_file_paths({argv[1]});
mcsl.set_loaded_water(false);//(true);
if(argc > 2) {
mcsl.set_coarse_level(atoi(argv[2]));
}
mcsl.set_allow_incomplete_chains(true);
bool b = mcsl.load(true, std::cout);
if(!b) {
return -1;
}
Covalent_structure& S = mcsl.get_molecular_covalent_structures()[0];
std::ofstream out("structure.dot");
S.print(out);
out.close();
std::cout << "Number of bonds (theoretical vs traversal) : " << boost::num_edges(S.get_graph()) << " " << std::distance(S.bonds_begin(true), S.bonds_end(true)) << "/" << std::distance(S.bonds_begin(false), S.bonds_end(false)) << std::endl;
//Compute the theoretical number of bond angles.
unsigned nb = 0;
for(Covalent_structure::Particles_iterator it = S.particles_begin(false); it != S.particles_end(false); it++)
{
unsigned n = boost::degree(*it, S.get_graph());
if(n > 1)
nb += n*(n-1)/2;
}
std::cout << "Number of bond angles (theoretical vs traversal) : " << nb << " " << std::distance(S.bond_angles_begin(true), S.bond_angles_end(true)) << "/" << std::distance(S.bond_angles_begin(false), S.bond_angles_end(false)) << std::endl;
//Compute the theoretical number of proper dihedrals.
nb = 0;
for(Covalent_structure::Bonds_iterator it = S.bonds_begin(false); it != S.bonds_end(false); it++)
{
unsigned n1 = boost::degree(boost::source(*it, S.get_graph()), S.get_graph()) - 1;
unsigned n2 = boost::degree(boost::target(*it, S.get_graph()), S.get_graph()) - 1;
if(n1 > 0 && n2 > 0)
nb += n1*n2;
}
std::cout << "Number of proper dihedrals (theoretical vs traversal) : " << nb << " " << std::distance(S.dihedral_angles_begin(true), S.dihedral_angles_end(true)) << "/" << std::distance(S.dihedral_angles_begin(false), S.dihedral_angles_end(false)) << std::endl;
//Compute the theoretical number of improper angles.
nb = 0;
for(Covalent_structure::Particles_iterator it = S.particles_begin(false); it != S.particles_end(false); it++)
{
unsigned n = boost::degree(*it, S.get_graph());
if(n > 2)
nb += n*(n-1)*(n-2)/6;
}
std::vector<Covalent_structure::Particle_rep> ccs;
S.get_molecules(std::back_inserter(ccs));
std::cout << "Number of molecules : " << ccs.size() << std::endl;
for(unsigned i = 0; i < ccs.size(); i++)
{
Covalent_structure mol;
S.get_molecule(ccs[i], mol);
std::cout << "Molecule " << (i+1) << " (particles / bonds) : " << std::distance(mol.particles_begin(true), mol.particles_end(true)) << "/" << std::distance(mol.particles_begin(false), mol.particles_end(false)) << " / " << std::distance(mol.bonds_begin(true), mol.bonds_end(true)) << "/" << std::distance(mol.bonds_begin(false), mol.bonds_end(false)) << std::endl;
Covalent_structure::Particles_iterator jt = mol.particles_begin(false);
Covalent_structure::Particle_rep p = *jt;
std::cout << "First atom: " << mol.get_graph()[p].atom_name << " " << mol.get_graph()[p].res_name << " " << mol.get_graph()[p].res_id << mol.get_graph()[p].ins_code << " " << mol.get_graph()[p].chain_id << std::endl;
Covalent_structure::Particles_iterator next = jt;
next++;
while(next != mol.particles_end(false))
{
jt++;
next++;
}
p = *jt;
std::cout << "Last atom: " << mol.get_graph()[p].atom_name << " " << mol.get_graph()[p].res_name << " " << mol.get_graph()[p].res_id << mol.get_graph()[p].ins_code << " " << mol.get_graph()[p].chain_id << std::endl;
}
return 0;
}
bool load(unsigned verbose=false, std::ostream &out=std::cout) override
Load function.
Definition Molecular_covalent_structure_loader.hpp:634
const std::vector< Molecular_covalent_structure > & get_molecular_covalent_structures(void) const
Retrieves the molecular covalent structures loaded from PDB / mmCIF file(s).
Definition Molecular_covalent_structure_loader.hpp:349