Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Linear_polymer_representation

Authors: G. Carrière and F. Cazals and C. Robert

Introduction: linear polymers and associated iterators

The current package is used to represent the two main classes of biomolecules offered in the SBL:

The main goal is to provide a coherent interface in term of iterators, to easily access all relevant pieces of information via iterators:

  • residues,
  • atoms,
  • atoms in residue,
  • backbone torsion angles,
  • side chain torsion angles.

Concepts and prerequisites

Representations for polymers and their atoms

Biomolecules: a mix of geometry, topology, biophysics, and biology. Biomolecules in general and polypeptide chains (PC in this package) in particular are complex objects. Their description indeed involves

  • geometric information i.e. the coordinates which may be Cartesian or internal – see also Molecular_coordinates ,
  • biophysical annotations inherently associated to the PCs: the hierarchical organization of PCs (atoms, amino-acids, whole chain), and also selected annotations (e.g. secondary structures). (See also Molecular_system for data structures giving access to such pieces of information.)

It should be stressed that these three categories of information exist independently and may be used independently. For example:

  • The study of conformations, say in the context of the exploration of potential (and free) energy landscape requires geometric and topological information to compute energies See e.g. the packages Molecular_potential_energy and Landscape_explorer .


Indices. As detailed in Section (Advanced/critical) Atoms and indices, the manipulation of atoms involves the following sets of indices:

  • Serial number: from the PDB/cif file,
  • Atomid: consecutive indices assigned by the file reader/molecular system loader,
  • Linear position: atom position in the structure, once gaps havebeen removed.
  • Particle representation, aka Vertex id: vertex id of the atom in the boost graph used to represent the molecular topology. This type is provided by the graph library used, boost graph in our case.

Finally, recall that an atom is termed embedded if it has Cartesian coordinates. In particular, all missing atoms in a PDB files are stored in the graph representing the molecule, but are not embedded.

As detailed in the package Molecular_system, we use the library libcifpp to parse PDB/mmCIF files. Doing so, atom ids are set as follows:
class PDBFileParser : defines  int mAtomID = 0; (class member)
Function PDBFileParser::ParseCoordinate(int modelNr) performs the increment: ++mAtomID;



Concepts. We briefly review the main concepts used in this package, and refer the user to the template parameters of the class SBL::CSB::T_Linear_polymer_representation for precise definition. These concepts are:

  • residue: the basic building block of a linear polymer, that is amino acids (proteins) or nucleotides (nucleic acids).
  • backbone trace: the vector of atom names constituting the pattern, which when repeated, defines the backbone of the biopolymer. For proteins, the trace consists of the triple of atoms "N", "CA", "C".
  • molecular covalent structure (MCS): the covalent structure used to store the molecular graph. See the package Molecular_covalent_structure.
  • chain: instance(s) of (MCS)
  • Dihedral_angles_map_type: the map used to name the dihedral angles found on the backbone of the biopolymer. NB: this is a static data member which is initialized in the derived classes.
For proteins, one has:
        Base::s_da_map =  {
          {P_phi, std::make_tuple("C", "N", "CA", "C")},
          {P_psi, std::make_tuple("N", "CA", "C", "N")},
          {P_omega,std::make_tuple("CA", "C", "N", "CA")}
        };
And for nucleic acids:
        Base::s_da_map =  {
          {NT_alpha,   std::make_tuple("O3'", "P",   "O5'", "C5'")},  // O3' here is in residue i-1
          {NT_beta,    std::make_tuple("P",   "O5'", "C5'", "C4'")},
          {NT_gamma,   std::make_tuple("O5'", "C5'", "C4'", "C3'")},
          {NT_delta,   std::make_tuple("C5'", "C4'", "C3'", "O3'")},
          {NT_epsilon, std::make_tuple("C4'", "C3'", "O3'",   "P")},  // here P is in residue i+1
          {NT_zeta,    std::make_tuple("C3'", "O3'",   "P", "O5'")},  // here P and O5' are in residue i+1
          //The following are the orientation dihedral "Chi" for pyrimidine/purine bases
          {NT_pyr,     std::make_tuple("O4'", "C1'",  "N1",  "C2")},  // same as Chi (pyrimidine base)
          {NT_pur,     std::make_tuple("O4'", "C1'",  "N9",  "C4")}   // same as Chi (purine base)
        };



Iterators. Iterators used to access a named piece of information are implemented using boost filtered iterators.

Let us take two examples:

  • Heavy atoms: the filter iterator uses a predicate targetting non hydrogen atoms:
     typedef boost::filter_iterator<Is_Heavy, Atoms_iterator>                      Heavy_iterator;
    
  • Iterator providing torsion angles of a given type (cf types given in the map Dihedral_angles_map_type): the filter iterator uses a predicate has_type() indicating wither the instance's type matches that if the in the aforementioned Dihedral_angles_map_type
typedef boost::filter_iterator<Is_Chosen_angle,  Dihedral_angle_const_iterator>   Chosen_angle_const_iterator;

Torsion angles: generic algorithms

Backbone torsion angles


Notations. Since we deal with branched polymers, we assume the following quantities are well defined:

  • $\text{\traceBB}$: the backbone trace, whose size is denoted $s$. Example: $[N, CA, C]$ for a protein, with $s=3$.
  • $\text{\traceSC}$: the side chain trace, that is the list of heavy atoms on the side chain.

A dihdedral angle is defined by a 4-tuple, corresponding in practice to four atoms of the molecular covalent structure (a boost graph). Given a covalent structure and a specific covalent angle, the goal is therefore to have a generic procedure to identifying these four atoms.


BB torsion angles. For a backbone trace of size $s$ and omitting the first and last residues, there are exactly $s+1$ torsion angles involving the $s$ atoms of that residue (Fig. fig:bb-torsion-angles}).

Indexing the atoms of the backbone trace with $s\in [0,s-1]$, these torsion angles are as follows:

  • The first one uses rotates around the covalent bond connecting the residue to the previous one. Its first atom has offset -2.
  • The last one rotates around the covalent bond connecting the residue to the next one. Its first atom has offset $s-2$

In total, we therefore get $s-2+2+1=s+1$ angles. Note that for proteins, these two extreme torsion angles correspond to the $\omega$ angles with the a.a. preceding and following the one of interest.

Given an offset $o\in [-2,s-1]$, the first atom defining the four tuple of a torsion angle is $\traceBB[ (s+o) \mod s]$.


Assuming $s\geq 2$, find a torsion faces 3 cases (only two for proteins):

  • Atoms in $res_{i-1}$ and $res_{i}$,
  • Atoms in $res_{i}$ only,
  • Atoms in $res_{i}$ and $res_{i+1}$.

Therefore it suffices to provide 3 functions:

  • $\text{\codecx{Find\_slice\_in\_previous\_residue}}$(offset_slice)
  • $\text{\codecx{Find\_slice\_in\_current\_residue}}$(offset_slice)
  • $\text{\codecx{Find\_slice\_in\_next\_residue}}$(offset_slice)

Backbone torsion angles with a backbone trace of size $s$: $s+1$ angles. Application: 4 torsion angles for a protein since its backbone trace is $[N, CA, C]$

Side chain torsion angles

The description above applies to a side chain whose atoms are linearly ordered.

In that case, we assume that the table $\text{\traceSC}$ contains this linear ordering.

Implementation and functionalities

Generic functions and accessors


(Important) accessors. The implementation of the generic functions presented in the previous section use the following generic accessors:

The following comments are in order:

  • Atoms iterators return const references, as one should not be able to modify atoms from Molecular_system.
  • Residue_atoms_iterator sanity checks that atoms are present in the covalent structure by fetching their corresponding nodes using their serial atom number
  • Getting atoms using atom names through get_atom() and get_incident_atom() return a <bool, Atom *> pair to cover the case where the atom is not found.
inline std::pair<bool, const Atom*> get_atom(const std::string& atom_name, const Residue& res) const;
inline std::pair<bool, const Atom*> get_incident_atom(const std::string& incident_atom_name, const Atom& atom) const;
inline std::pair<bool, FT> get_backbone_torsion_angle(int offset_starting_atom, const Residue& res) const;


Side chains and their torsion angles

The backbone trace mentioned above enables processing coherently proteins and nucleic acids.

Side chains tough requires a specific processing. However, the following is a generic iterator used to process all sides chains coherently:

  //! \brief Generic dihedral angle iterator for arbitrary ordred atom chain
      class Ordered_chain_iterator;

The ordered chain iterator is given an atom names list, corresponding to the side chain, to follow from a starting atom. Each consecutive 4-tuple of atoms encountered by the iterator is then returned as a dihedral angle of the side chain.

At each step the next atom is searched using its name among the incident atoms of the current atom. This process stops when the next atom in the list is not embedded or the end of the list is reached, at which point all available dihedral angles of the side chain have been covered and an iterator end is returned.

The reader is referred to the following classes for more details:

Construction

The reader is referred to the following resources for the construction of linear polymers:

Advanced

The $xyz$ coordinates are accessible in two guises:

  • Coordinates from the atom of the molecular system. These coordinates are static are do not change over time in the course of a simulation.
  • Coordinates from the Conformation type defined in the traits class. These are the coordinates which may change over time. Note that the float type used for these coordinates may be a type supporting arbitrary precision – as in the package Tripeptide_loop_closure.

This explains the following design:

// typedef typename SBL::CSB::EPIC_kernel_with_atom::FT FT; // Formerly: not correct. imposes a double type incompatible with the Conformation
typedef typename Conformation_traits::FT FT; // Float type from the ConformationTraits... != FT from MolecularSystem
typedef CGAL::Simple_cartesian<FT>::Point_3 Point_3;
// NB: design explanations
// typedef typename SBL::CSB::EPIC_kernel_with_atom::Point_3 Point_3;// Formerly: erroneous since this uses double for xyz coords
//typedef typename Conformation_type::Point_3 Point_3;// Not possible: the conformation does not have to define a Point_3

Examples

Two examples are provided in the SBL: