![]() |
Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
|

Authors: J. Bernauer, F. Cazals and S. Loriot and S. Guo
Molecular systems are (generally) loaded from PDB/mmCIF files:
The former is simple and easy to parse; the latter is much more complex/verbose, and requires more efforts.
We therefore use two different parsers: ESBTL for PDB files, and cifpp for mmCIF files. The former is optimized for PDB files and about 20 times faster than the latter.
These parser are integrated in a unified parser which uses ESBTL for PDB files, and cifpp for mmCIF files–see details below. This unified parser delivers a molecular system, as detailed in the package Molecular_system.
We now review them in turn.
ESBTL (Easy Structural Biology Template Library) is a lightweight C++ header-only library for handling PDB data with a data structure suitable for geometric analysis and advanced constructions [138]. The parser and data model provided by this library allows adequate treatment of usually discarded information (insertion codes, atom occupancy, etc.) while still being able to detect badly formatted files.
This package was originally part of SBL until version 1.5.2, and has been resurrected in version 2.0+ due to user demand for faster PDB loading. ESBTL performs direct line-by-line text parsing of PDB files, which is significantly faster than the generic CIF parser (libcifpp) for PDB format files.
For more information about the original ESBTL library, see http://esbtl.sf.net.
The package provides the following main components:
ESBTL provides access to a hierarchical data structure representing molecular systems:
System
└── Model(s)
└── Chain(s)
└── Residue(s)
└── Atom(s)
The types ESBTL::Default_system, ESBTL::Default_system::Model, ESBTL::Default_system::Chain, ESBTL::Default_system::Residue and ESBTL::Default_system::Atom provide access to all hierarchy levels.
Reading a PDB file with ESBTL is performed in four stages:
Once a system has been built, iterators are provided to access the hierarchy information from any higher level. For 'Father' standing for either System, Model, Chain or Residue, and 'Son' standing for either Model, Chain, Residue or Atom, the following are provided:
Note that the hierarchy must be respected (e.g., if 'Father' is System, 'Son' can only be Model).
Line selectors allow filtering atoms during the parsing stage. Each selection can be stored within a different system. ESBTL provides ESBTL::Generic_line_selector for custom filtering based on atom properties.
The unified loader uses ESBTL for plain PDB files and the libcifpp C++ library for mmCIF files.
The unified loader dispatches to the appropriate parser based on file extension:
┌─────────────────────────────────────────────────────────┐
│ T_Unified_molecular_system_loader │
│ (automatically selects parser based on file extension) │
└─────────────────────────────────────────────────────────┘
│
┌────────────────────────────┴────────────────────────────┐
│ │
▼ ▼
┌─────────────────────────────┐ ┌─────────────────────────────┐
│ PDB File (.pdb, .ent) │ │ CIF File (.cif, .mmcif) │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ ESBTL Line_reader │ │ libcifpp cif::file │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ ESBTL::Molecular_system │ │ cif::mm::structure │
└─────────────────────────────┘ └─────────────────────────────┘
│ │
└────────────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ SBL::CSB::Molecular_system │
└─────────────────────────────┘
The T_Molecular_covalent_structure_loader now inherits from T_Unified_molecular_system_loader, enabling automatic ESBTL-based parsing for PDB files throughout the MCS loading pipeline:
T_Protein_representation_loader / T_Biomolecule_representation_loader
▲
│
T_Molecular_covalent_structure_loader (builds MCS using appropriate builder)
▲
│
T_Unified_molecular_system_loader (base class - uses ESBTL for PDB, libcifpp for CIF)
The unified molecular system loader provides the following command-line options:
Molecular system loader: -f [ --filename ] arg Input file(s) - PDB (.pdb, .ent) or CIF (.cif, .mmcif) --water Load water molecules (default: false) --hetatoms Load hetero-atoms (default: false) --hydrogens Load hydrogen atoms (default: false) -a [ --alternate ] arg (= ) Alternate location to use (default: first found) -p [ --occupancy-policy ] arg (=3) Occupancy policy: 1=all, 2=none, 3=max (default), 4=min -B [ --B-factor-limit ] arg Temperature factor limit (default: no limit) --load-chains arg Chains to load (e.g., A,B,C) --load-models arg Models to load (e.g., 1,2,3) --use-esbtl Force ESBTL parser for all files (PDB format only) --use-cifpp Force libcifpp parser for all files
In a PDB file, occupancy (columns 55-60) is a value between 0.0 and 1.0 indicating the fraction of time an atom occupies a particular position in the crystal. When an atom has multiple possible positions (disorder), each position is recorded with:
atom name
| alternate location
| |
v v
ATOM 145 CA AALA A 20 10.123 20.456 30.789 0.60 15.00 C
ATOM 146 CA BALA A 20 10.456 20.123 30.456 0.40 18.00 C
^ ^
occupancy B-factor
Occupancy policies apply ONLY to atoms with multiple alternate positions. Atoms with a single position (no alternate location) are always kept, regardless of their occupancy value.
Both loaders (ESBTL for PDB and libcifpp for CIF) use the same policy enumeration and apply policies consistently to atoms with multiple alternate positions:
| Policy | Value | Behavior |
|---|---|---|
| OP_ALL | 1 | Keep all alternate positions (no filtering) |
| OP_NONE | 2 | Discard all alternate positions |
| OP_MAX | 3 | Keep only the alternate with highest occupancy (B-factor tie-break, default) |
| OP_MIN | 4 | Keep only the alternate with lowest occupancy (B-factor tie-break) |
ESBTL implements occupancy policies as template classes that only act on atoms with multiple alternate positions:
SBL's CIF loader (T_Molecular_system_loader) implements the same policies by filtering atoms based on label_alt_id (alternate location identifier in CIF format).
Consider this structure with two alternate positions for CA:
ATOM 145 CA AALA A 20 10.123 20.456 30.789 0.60 15.00 C ATOM 146 CA BALA A 20 10.456 20.123 30.456 0.40 18.00 C
With --occupancy-policy 1 (ALL), both PDB and CIF loaders will keep BOTH atoms (all alternate positions are preserved).
With --occupancy-policy 2 (NONE), both PDB and CIF loaders will discard BOTH atoms (atoms 145 and 146) because they have alternate locations. Only atoms with a single position (no alternate location indicator) are kept.
With --occupancy-policy 3 (MAX), both PDB and CIF loaders will keep only the atom with the highest occupancy: 0.60 (alternate A), discarding the one with 0.40 (alternate B).
With --occupancy-policy 4 (MIN), both PDB and CIF loaders will keep only the atom with the lowest occupancy: 0.40 (alternate B), discarding the one with 0.60 (alternate A).
However, for an atom with only one position:
ATOM 147 CB ALA A 20 11.000 21.000 31.000 0.80 12.00 C
This atom is always kept regardless of the policy, even though its occupancy is 0.80 < 1.0. The occupancy policy only applies when choosing among multiple alternate positions.