ESBTL

Authors: J. Bernauer, F. Cazals and S. Loriot and S. Guo

Introduction

File formats: PDB and mmCIF

Molecular systems are (generally) loaded from PDB/mmCIF files:

The (legacy) PDB format: the reference format for biomolecules whose structure has been resolved and are deposited in the Protein Data Bank ( PDB website).

The mmCIF which is now the default format for the Protein Data Bank.

The former is simple and easy to parse; the latter is much more complex/verbose, and requires more efforts.

We therefore use two different parsers: ESBTL for PDB files, and cifpp for mmCIF files. The former is optimized for PDB files and about 20 times faster than the latter.

These parser are integrated in a unified parser which uses ESBTL for PDB files, and cifpp for mmCIF files–see details below. This unified parser delivers a molecular system, as detailed in the package Molecular_system.

We now review them in turn.

For molecules other than peptides and proteins, we use the MOL format – see section Beyond the PDB format and MOL format.

The ESBTL loader

Overview

ESBTL (Easy Structural Biology Template Library) is a lightweight C++ header-only library for handling PDB data with a data structure suitable for geometric analysis and advanced constructions [138]. The parser and data model provided by this library allows adequate treatment of usually discarded information (insertion codes, atom occupancy, etc.) while still being able to detect badly formatted files.

This package was originally part of SBL until version 1.5.2, and has been resurrected in version 2.0+ due to user demand for faster PDB loading. ESBTL performs direct line-by-line text parsing of PDB files, which is significantly faster than the generic CIF parser (libcifpp) for PDB format files.

For more information about the original ESBTL library, see http://esbtl.sf.net.

Architecture

The package provides the following main components:

T_ESBTL_molecular_system_loader: A fast PDB-specific loader using ESBTL's optimized line-by-line parser. Outputs SBL::CSB::Molecular_system objects compatible with the rest of SBL.

Hierarchical Data Structure

ESBTL provides access to a hierarchical data structure representing molecular systems:

System
  └── Model(s)
        └── Chain(s)
              └── Residue(s)
                    └── Atom(s)

The types ESBTL::Default_system, ESBTL::Default_system::Model, ESBTL::Default_system::Chain, ESBTL::Default_system::Residue and ESBTL::Default_system::Atom provide access to all hierarchy levels.

Reading a PDB File

Reading a PDB file with ESBTL is performed in four stages:

A system type and a container for the systems have to be defined.
A line selector is used to define what the systems will be made of.
A builder is used to fill the system container (see ESBTL::All_atom_system_builder).
An occupancy policy has to be chosen (see Occupancy policies).

Example using ESBTL directly

#include <ESBTL/default.h>
 
// Create one system with all atoms
ESBTL::PDB_line_selector sel;
std::vector<ESBTL::Default_system> systems;
 
// Build the system from the PDB file
ESBTL::All_atom_system_builder<ESBTL::Default_system> builder(systems, sel.max_nb_systems());
ESBTL::read_a_pdb_file(filename, sel, builder, Accept_none_occupancy_policy());

Iterators

Once a system has been built, iterators are provided to access the hierarchy information from any higher level. For 'Father' standing for either System, Model, Chain or Residue, and 'Son' standing for either Model, Chain, Residue or Atom, the following are provided:

Iterator types: Father::Sons_const_iterator and Father::Sons_iterator
Functions: Father::sons_begin(), Father::sons_end()

Note that the hierarchy must be respected (e.g., if 'Father' is System, 'Son' can only be Model).

Line Selectors

Line selectors allow filtering atoms during the parsing stage. Each selection can be stored within a different system. ESBTL provides ESBTL::Generic_line_selector for custom filtering based on atom properties.

Unified loader

Architecture

The unified loader uses ESBTL for plain PDB files and the libcifpp C++ library for mmCIF files.

T_Unified_molecular_system_loader: A unified loader that automatically selects the best parser based on file extension:
- .pdb, .ent files → ESBTL (fast)
- .cif, .mmcif files → libcifpp

The unified loader dispatches to the appropriate parser based on file extension:

                    ┌─────────────────────────────────────────────────────────┐
                    │           T_Unified_molecular_system_loader             │
                    │  (automatically selects parser based on file extension) │
                    └─────────────────────────────────────────────────────────┘
                                           │
              ┌────────────────────────────┴────────────────────────────┐
              │                                                         │
              ▼                                                         ▼
┌─────────────────────────────┐                         ┌─────────────────────────────┐
│  PDB File (.pdb, .ent)      │                         │  CIF File (.cif, .mmcif)    │
│            │                │                         │            │                │
│            ▼                │                         │            ▼                │
│  ESBTL Line_reader          │                         │  libcifpp cif::file         │
│            │                │                         │            │                │
│            ▼                │                         │            ▼                │
│  ESBTL::Molecular_system    │                         │  cif::mm::structure         │
└─────────────────────────────┘                         └─────────────────────────────┘
              │                                                         │
              └────────────────────────────┬────────────────────────────┘
                                           │
                                           ▼
                            ┌─────────────────────────────┐
                            │  SBL::CSB::Molecular_system │
                            └─────────────────────────────┘

Integration with Molecular Covalent Structure

The T_Molecular_covalent_structure_loader now inherits from T_Unified_molecular_system_loader, enabling automatic ESBTL-based parsing for PDB files throughout the MCS loading pipeline:

T_Protein_representation_loader / T_Biomolecule_representation_loader
            ▲
            │
T_Molecular_covalent_structure_loader (builds MCS using appropriate builder)
            ▲
            │
T_Unified_molecular_system_loader (base class - uses ESBTL for PDB, libcifpp for CIF)

Loader Options

The unified molecular system loader provides the following command-line options:

Molecular system loader:
  -f [ --filename ] arg                 Input file(s) - PDB (.pdb, .ent) or CIF (.cif, .mmcif)
  --water                               Load water molecules (default: false)
  --hetatoms                            Load hetero-atoms (default: false)
  --hydrogens                           Load hydrogen atoms (default: false)
  -a [ --alternate ] arg (= )           Alternate location to use (default: first found)
  -p [ --occupancy-policy ] arg (=3)    Occupancy policy: 1=all, 2=none, 3=max (default), 4=min
  -B [ --B-factor-limit ] arg           Temperature factor limit (default: no limit)
  --load-chains arg                     Chains to load (e.g., A,B,C)
  --load-models arg                     Models to load (e.g., 1,2,3)
  --use-esbtl                           Force ESBTL parser for all files (PDB format only)
  --use-cifpp                           Force libcifpp parser for all files

Example using the Unified Loader

#include <SBL/IO/Unified_molecular_system_loader.hpp>
 
// The unified loader automatically uses ESBTL for .pdb files
SBL::IO::Unified_molecular_system_loader loader;
loader.set_loaded_file_paths({"protein.pdb"});
loader.load(true, std::cout);
 
// Access the loaded molecular systems
auto& systems = loader.get_molecular_systems();

Occupancy Policies

Background: What is Occupancy?

In a PDB file, occupancy (columns 55-60) is a value between 0.0 and 1.0 indicating the fraction of time an atom occupies a particular position in the crystal. When an atom has multiple possible positions (disorder), each position is recorded with:

An alternate location identifier (column 17): a letter like 'A', 'B', 'C'
An occupancy value: the fraction for that position (all alternates should sum to 1.0)

             atom name
             |  alternate location
             |  |
             v  v
ATOM    145  CA AALA A  20      10.123  20.456  30.789  0.60 15.00           C
ATOM    146  CA BALA A  20      10.456  20.123  30.456  0.40 18.00           C
                                                         ^     ^
                                                   occupancy  B-factor

General Principle

Occupancy policies apply ONLY to atoms with multiple alternate positions. Atoms with a single position (no alternate location) are always kept, regardless of their occupancy value.

Available Policies

Both loaders (ESBTL for PDB and libcifpp for CIF) use the same policy enumeration and apply policies consistently to atoms with multiple alternate positions:

Policy	Value	Behavior
OP_ALL	1	Keep all alternate positions (no filtering)
OP_NONE	2	Discard all alternate positions
OP_MAX	3	Keep only the alternate with highest occupancy (B-factor tie-break, default)
OP_MIN	4	Keep only the alternate with lowest occupancy (B-factor tie-break)

Note: Atoms with only ONE position (no alternate location) are always kept, regardless of the policy or their occupancy value. The policies only affect the selection among multiple alternate positions of the same atom.

Implementation Details

ESBTL implements occupancy policies as template classes that only act on atoms with multiple alternate positions:

Accept_all_occupancy_policy → OP_ALL: keeps all alternates
Accept_none_occupancy_policy → OP_NONE: discards all alternates
Max_occupancy_policy → OP_MAX: keeps highest occupancy alternate
Min_occupancy_policy → OP_MIN: keeps lowest occupancy alternate

SBL's CIF loader (T_Molecular_system_loader) implements the same policies by filtering atoms based on label_alt_id (alternate location identifier in CIF format).

Example

Consider this structure with two alternate positions for CA:

ATOM    145  CA AALA A  20      10.123  20.456  30.789  0.60 15.00           C
ATOM    146  CA BALA A  20      10.456  20.123  30.456  0.40 18.00           C

With --occupancy-policy 1 (ALL), both PDB and CIF loaders will keep BOTH atoms (all alternate positions are preserved).

With --occupancy-policy 2 (NONE), both PDB and CIF loaders will discard BOTH atoms (atoms 145 and 146) because they have alternate locations. Only atoms with a single position (no alternate location indicator) are kept.

With --occupancy-policy 3 (MAX), both PDB and CIF loaders will keep only the atom with the highest occupancy: 0.60 (alternate A), discarding the one with 0.40 (alternate B).

With --occupancy-policy 4 (MIN), both PDB and CIF loaders will keep only the atom with the lowest occupancy: 0.40 (alternate B), discarding the one with 0.60 (alternate A).

Note: When multiple alternates have the same occupancy value, the B-factor is used as a tie-breaker (lowest B-factor wins) for both PDB and CIF loaders.

However, for an atom with only one position:

ATOM    147  CB  ALA A  20      11.000  21.000  31.000  0.80 12.00           C

This atom is always kept regardless of the policy, even though its occupancy is 0.80 < 1.0. The occupancy policy only applies when choosing among multiple alternate positions.