Biomolecule_representation

Authors: F. Cazals and C. Robert

Introduction

As its title figure suggests, this package makes it possible to coherently handle proteins and nucleic acids. As such, it has strong connexions with the following packages:

Protein_representation : package providing classes to represent polypeptide chains and proteins.

Nucleic_acid_representation : package providing classes to represent polynucleotide chains in nucleic acids.

Linear_polymer_representation : generic package underlying both polypeptide chains and polynucleotide chains, which are both linear polymers.

Molecular_covalent_structure : package handling molecular covalent structures, which are used both for polypeptide and polynucleotide chains.

Molecular_system : package for loading the PDB/mmCIF file.

The documentation of this package is kept as concise as possible, and we only focus on the architecture used to handle both types of biomolecules.

Loading biomolecular representations: example

The following snippet (from sbl-biomolecule-info.cpp) shows how to load a file and access the corresponding biomolecules and their chains (protein: polypeptide chains; nucleic acid: polynucleotide chains).

File: sbl-biomolecule-info.cpp

int main(int argv, char** argc){
 
  Biomolecule_loader loader;
  parse_command_line(argv, argc, loader);
  bool b = loader.load(true, std::cout);
  if(!b)
    return -1;
 
  // Protein chains + nucleic acid chains
  //Protein_representation& prot = *(loader.get_protein_representations()[0]);
  std::vector<Protein_representation*>&    prots = loader.get_protein_representations();
  std::vector<Nucleic_acid_representation*>& nas = loader.get_nucleic_acid_representations();
  std::cerr << (boost::format("Build type checks wtih num prots and nas: %d %d\n") % prots.size() % nas.size()).str();
 
  for(auto prot : prots)
    for(Protein_representation::Polypeptide_chains_iterator    it = prot->chains_begin(); it != prot->chains_end(); it++)
      stats_polypeptide_chain(it->second);
 
  for(auto na : nas)
    for(Nucleic_acid_representation::Polynucleotide_chains_iterator  it = na->chains_begin(); it != na->chains_end(); it++)
      stats_polynucleotide_chain(it->second);
 
  return 0;
 
}

(Advanced) Implementation

This section presents the architecture for handling biomolecules.

To facilitate reading, recall that a builder builds a data structure, while a loader loads a PDB/mmCIF file and calls a builder.

Architecture

In short, the construction of biomolecules from a PDB/mmCIF file is handled as follows:

Step 0. The molecular system is loaded from the PDB/mmCIF file.
Step 1.
1. A unique molecular covalent structure is built, in two steps: first, the residues of the molecular systems are linked into polymer chains of amino acids or nucleotides; second, each polymer chain is added to the MCS.
  
  In general, polypeptide and polynucleotide chains are disconnected. Which means that upon building all polymers, the number of connected components of the MCS is exactly the number of chains.
2. The geometries, also referred to as the atomic coordinates or the molecular conformation(s), are stored.
Step 2. The protein and nucleic acid representations are created. To this end, a new MCS is created per polypeptide or polynucleotide chain, and is attributed to its biomolecule (protein or nucleic acid).

The property stating that the number of connected components matches the number of chains does not hold if one inserts extra bonds, e.g. a disulfide bond, or a bond connecting a protein and a nucleic acid.

We now detail these steps and the associated classes.

$\begin{algorithm}\begin{algorithmic}[1] \Procedure{BiomolecularRepresentationLoader}{PDB/mmCIF file(s)} \State Load file(s) and record sequence of residues and conformations \For{each file} \State Call MCS builder to build the global MCS of that file \State Extract the protein and nucleic acid representation \EndFor \EndProcedure \end{algorithmic} \end{algorithm}$

Details

We now detail the previous steps, referring to the C++ classes provided.

Hierarchy of MCS builders. As noticed above, a MCS builder creates a graph with as many subgraphs as there are chains in the biomolecule. Because our builders handle linear polymers, they are implemented in a generic fashion which factors out common functionalities – Fig. fig-builders-diamond.

We have seen in the Molecular_covalent_structure package that various instantiations of MCS are available, corresponding to the pieces of information associated with an atom. As detailed in section Molecular Covalent Structure: instantiations, the corresponding suffixes are FIT, FIAT, HIT, and HIAT.

Molecular covalent structure builders: the diamond pattern used for biomolecules. The generic builder handles operations common to all linear polymers. The two derived classes handle chains of amino acids and nucleotides, respectively. Finally, the builder for biomolecules uses the previous two as template parameters, to convert connected stretches of residues into a molecular covalent structure.

The MCS builder for biomolecules and the switch Protein vs Nucleic Acid. A MCS builder for biomolecules requires template parameters specifying a builder for proteins and one for nucleic acids. Given these template parameters, the sketch of the construction, using a switch to distinguish proteins from nucleic acids, is as follows:

File: Molecular_covalent_structure_builder_biomolecules.hpp

 template <class CovStructBuilderForProteins, class CovStructBuilderForNucleicAcids>
    template <class Residue_info>
    bool     T_Molecular_covalent_structure_builder_for_biomolecules<CovStructBuilderForProteins,CovStructBuilderForNucleicAcids>::
    operator()(const std::map<std::string, std::vector<Residue_info>>& chain_residue_info, Molecular_covalent_structure& S) {
 
        // Processes stretches of residues, each defining a chain
        for(auto& residue_infos : chain_residue_info) {
 
        if (is_sequence_of_amino_acids(residue_info))
          {m_build_type = BUILD_PROTEIN; S.record_type(chainid, BUILD_PROTEIN); std::cerr << "Building a protein\n"; }
        else if (is_sequence_of_nucleotides(residue_info))
          {m_build_type = BUILD_NA; S.record_type(chainid, BUILD_NA);  std::cerr << "Building a nucleic acid\n"; }
 
}
 
}

The following are critical observations:

In the previous snippet, the MCS denoted S contains all chains, be they polypeptide chains or polynucleotide chains.

The switch is_sequence_of_amino_acids(residue_info) versus is_sequence_of_amino_acids(residue_info) is determined by the number of letters coding the nature of the residue: 3 characters for an amino acid, 1 or 2 characters for a nucleotide.

In the sequel, builder refers to an instance of this builder for biomolecules.

Step 1: building the global MCS with the MCS loader. The MCS loader loads PDB files and call the builder building one MCS per PDB/mmCIF file processed.

File: Molecular_covalent_structure_loader.hpp

 template <class Molecular_covalent_structure_builder>
    void T_Molecular_covalent_structure_loader<Molecular_covalent_structure_builder_>::
    load_molecular_covalent_structures(void) {
 
    // Step 1: building the global MCS
    for (auto& fp : Base::m_loaded_file_paths) {
 
          // Residues associated to a given chain: map the chain name to its residues
         std::map<std::string, std::vector<Residue_info>> chain_residue_info;
 
        // Read the file and identify the residues of a chain
        std::filesystem::path file_path(fp);
        bool b = Self::get_residue_sequence(file_path, chain_residue_info);
 
        // Build the MCS for that file
        Molecular_covalent_structure mcs;
        bool b = mcs_builder(chain_residue_info, mcs);
        ...
  }
 
}

The code shown here is generic. When the builder passed is a builder for biomolecules, it can create subgraphs corresponding to polypeptide or polynucleotide chains. But it can also be passed a builder for proteins only (package Protein_representation), or for nucleic acids only (package Nucleic_acid_representation).

Step 2: building the protein and nucleic acid representations with the Biomolecule representation loader.

The biomolecule representation loader is parameterized by the two returned types (ProteinRepresentation and NucleicAcidRepresentation), and the class used to perform Step 1 (the MCSLoader).

File: Biomolecule_representation_loader.hpp

template <class ProteinRepresentation, class NucleicAcidRepresentation, class MCSLoader> // = Default_protein_representation>
  class T_Biomolecule_representation_loader : public Loader_base
  {
  ...
  }

The instantiation using the aforementioned lineage HIT of data structures is:

typedef SBL::IO::T_Biomolecule_representation_loader<Protein_representation_HIT, Nucleic_acid_representation_HIT, MCS_loader_biomols_HIT> Biomolecule_loader_HIT;

SBL::IO::T_Biomolecule_representation_loader

Loader for proteins from PDB and mmCIF files using SBL.

Definition Biomolecule_representation_loader.hpp:44

Then, the creation of protein and nucleic acid representations goes as follows:

Creation of tuples for all combinations of chains of the global MCS $\times$ geometries (.e.g PDB models).
Assignment of a given chain to its protein or nucleic acid representation.

File Biomolecule_representation_loader.hpp

template <class ProteinRepresentation, class NucleicAcidRepresentation, class MCSLoader>
  bool    T_Biomolecule_representation_loader <ProteinRepresentation,NucleicAcidRepresentation,MCSLoader>::
  load_protein_and_na_representations(unsigned verbose, std::ostream& out) {
 
  for(auto tup : tuples) {
        // Create the protein representation
        if (tmp->get_my_chain_type() == SBL::CSB::BUILD_PROTEIN){
          m_protein_representations.push_back(new Protein_representation());
          auto& c_rep = m_protein_representations.back();
          // See Protein_representation.hpp: void create_chain(char c, Chain& chain, Molecular_covalent_structure& structure, const Conformation_type& C)
          c_rep->create_chain(boost::get<0>(tup).second.chain_identifier(), boost::get<0>(tup).second, *(boost::get<1>(tup)), *(boost::get<2>(tup)));
        }
        // Create the nucleic acid representation
        else if (tmp->get_unique_type() == SBL::CSB::BUILD_NA){
          m_nucleic_acid_representations.push_back(new Nucleic_acid_representation());
          auto& c_rep = m_nucleic_acid_representations.back();
          // See Nucleic_acid_representation.hpp: void create_chain(char c, Chain& chain, Molecular_covalent_structure& structure, const Conformation_type& C);
          c_rep->create_chain(boost::get<0>(tup).second.chain_identifier(), boost::get<0>(tup).second, *(boost::get<1>(tup)), *(boost::get<2>(tup)));
        }
      } // tuples: each tuple yields a chain in a protein or nucleic acid representation
  ...
 }

NB: In the code above, get_my_chain_type() refers to the fact that the MCS used to create a protein representation necessarily hosts a single chain, as opposed to the global MCS used in Step 1.