Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Molecular_system

Authors: F. Cazals and S. Loriot and C. Le Breton

Introduction

Hierarchical organization


Hierarchy. This package allows to load and store molecular systems from PDB and mmCIF files described below.

A Molecular system, in the context of a PDB (Protein Data Bank) or mmCIF (macromolecular Crystallographic Information File) file, refers to a representation of biomolecules such as proteins, nucleic acids, or complexes thereof. It serves as a hierarchical data structure that organizes and stores the information extracted from these files in a standardized format.

A Molecular system represents a filtered version of the biomolecular entity described in the file. It includes a subset of the information extracted and organized from the file. The Molecular system selectively captures and stores relevant data necessary for structural analysis and modeling. It provides a systematic and organized framework to represent and navigate through the various components of the biomolecule.

Molecular system > Molecular model > Molecular chain > Molecular residue > Molecular atom

Let us inspect these levels:

  • At the highest level of the hierarchy, the Molecular system acts as a container for one or more Molecular models. A Molecular model represents a particular instantiation of the biomolecule. For techniques like X-ray crystallography or Cryo-Electron Microscopy, typically, a single model is present, while for Nuclear Magnetic Resonance (NMR) studies, multiple models may exist, representing an ensemble of conformations.
  • Each Molecular model contains one or more Molecular chains. A Molecular chain represents a distinct polypeptide (or nucleotide, while not yet supported) chain within the biomolecule. It serves as a container for Molecular residues.
  • Molecular residues are units that form the structural components of the biomolecule, such as amino acids in proteins (or nucleotides in nucleic acids, while not yet supported). They are organized within Molecular chains and further contain Molecular atoms.
  • Molecular atoms represent individual atoms within the biomolecule, such as carbon, nitrogen, oxygen, etc in the most classical sense. Nonetheless, they can represent larger components (assembly of atoms, or particles in alternative computational models). They possess properties like Cartesian coordinates, which define their positions in three-dimensional space.

Input files and filtering of atoms

File formats. Molecular systems are (generally) loaded from PDB/mmCIF files parsed using the libcifpp library. PDB/mmCIF files contains a variety of pieces of information (SSE, etc); see the package Molecular_system to see which are loaded into SBL data structures. Practically, the following file formats are used:

  • The (legacy) PDB format: the reference format for biomolecules whose structure has been resolved and are deposited in the Protein Data Bank ( PDB website).

mmCFI and loaded pieces of information. A molecular system is based on the atomic information provided in PDB or mmCIF files.

The following pieces of information are retrieved from a mmCIF file: atomic information, SSE, SS bonds.

Filtering of atoms. The class SBL::IO::T_Molecular_system_loader from the package Molecular_system : implements the following default behaviors:

  • model selection – applies for NMR input files only: all models loaded.
  • chain selection : all chains loaded.
  • hydrogen atoms: not loaded.
  • water molecules : not loaded.
  • hetero-atoms : not loaded.
  • occupancy mode: max value by default. Tie broken using the smallest B factor. Tie again broken with the first occurrence if needed.
  • B factor limit: all values allowed.

Implementation and functionalities

Classes

  • SBL::CSB::Molecular_system is essentially a container for Molecular_model items. It provides nested iterators over the items it contains, up to the Molecular_atom.
  • SBL::CSB::Molecular_model is essentially a container for Molecular_chain items, it provides nested iterators over the items it contains, up to the Molecular_atom item. Usually, there is one model within a Molecular_system for X-ray crystallography, Cryo-Electron Microscopy, and several models for Nuclear Magnetic Resonance.
  • SBL::CSB::Molecular_chain is essentially a container for Molecular_residue items. It provides nested iterators over the items it contains, up to the Molecular_atom.
  • SBL::CSB::Molecular_residue is essentially a container for Molecular_atom items. It provides iterators for the Molecular_atom items it contains.

Defaults and variants

Several sets of these molecular items are provided in the SBL:

  • A default version of the Molecular_system items is provided as the Default_system_items struct.
  • System_items_with_coarse_grain is a more refined version that redefines Molecular_residue, replacing it with a Coarse_residue which inherits the default Molecular_residue. It allows working with geometrical approximations by providing a coarse-grained definition of molecular models and is used, for instance, in Space_filling_model_coarse_graining.

Since each of the molecular items is templated by the Molecular_system, it is possible to determine the type of each item based on the others. For instance, with a reference to a Molecular_atom, the Molecular_residue, Molecular_chain, Molecular_model, and Molecular_system to which it belongs can be inferred and retrieved.

Connexions to other important classes

Relationship to covalent structure Molecular_covalent_structure. A molecular covalent structure is a graph whose vertices have properties called particle info, which in the simplest case is a string. For a protein SBL::CSB::Particle_info_for_protein contains a pointer to an instance of SBL::CSB::Molecular_atom. (NB: this pointer is null if the atom is absent from the input file, e.g. a hydrogen atom.)

For a polypeptide chain, the construction of the covalent structure from the molecular system is possible if all amino acids are present. If not using the amino acid sequence is mandatory.

We also note in passing that disulfide bridges are defined from pairwise CYS within a prescribed distance threshold, see Pointwise_interactions. Note that this strategy makes it possible to identify SS bonds not listed in the mmCIF file.

Relationship to molecular conformations Molecular_conformation. A molecular model corresponds to one conformation–setting aside alternate locations. A conformation can be built from a molecular conformation, see Molecular_conformation.

Relationship to protein representation Protein_representation. The classes SBL::CSB::Polypeptide_chain and SBL::CSB::Protein_representation bridge the gap between atoms in the covalent structure and coordinates in the conformation. The mapping between both is done using the chain/resid/atom ids. Atoms in the covalent structure devoid of coordinates are termed not embedded.

Labels. The previous hierarchy does not take into account the possible assignment of chains/residues/atoms to specific sub-systems or partners. This possibility is provided by classes from the package MolecularSystemLabelsTraits.