Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Kpax

Authors: F. Cazals and R. Tetley

Goals and algorithm overview

Goals

The $\text{\kpax}$ algorithm is a reference structural aligner developed by the late Dave Ritchie [151] . This package provides a generic implementation of this algorithm. This generic implementation is instantiated for the case of proteins, but other biomolecules could easily be handled.

Algorithm overview

In short, having computed an initial 3D superposition of the two structures based on so-called seeds, Kpax performs a number of iterations involving two steps [151] :

  • run a dynamic programming step to compute an alignment, based on a score $\ktotij$ measuring the similarity between two residues.
  • using the alignment obtained, computes a novel rigid motion to superimpose the two structures.

To get insights on the similarity measure used, we only recall the definition of the score $\ktotij$, based on the following three terms:

  • $\klocij$ : a local similarity for a pair of residues, calculated in a common coordinate frame. More precisely, this term is calculated as a Gaussian of the weighted sum of the squared distances between pairs of up-stream and down-stream $\Calpha$ atoms.
  • $\kcomij$: a term measuring the spatial similarity with respect to the relative position of the center of mass (COM) of a protein (also calculated in the same frame).
  • $\kbloij$: the Blosum62 amino acid similarity score

With these, the total Kpax score for two residues of the two structures to be aligned reads as

\begin{equation} \ktotij = w_l \klocij + w_s \kcomij + w_b \kbloij, w_l + w_s + w_b =1. \end{equation}

Because Kpax uses the previous score and not the lRMSD, for a given alignment length, the optimal alignment obtained may not be that minimizing the lRMSD.


An important implementation detail for Kpax is the following. The aforementioned iterative process starts from an initial alignment computed from scores encoding so-called canonized fragments. If this alignment is too small (current threshold: 4 a.a.), seeds to start the iterative process cannot be found. In that case, a predefined number of random seeds is computed. A random seed is a crossing free random alignment between the two sequences, whose length is set to the size of the smallest protein.


Using Kpax

Input: Specifications and File Types

The main input consists of two PDB/mmCIF files.

Output: Specifications and File Types

Example call:

sbl-kpax-iterative-alignment.exe -f data/1anx.pdb -f data/2ran.pdb --load-chains A --load-chains A -d output-directory --log

The output consists of the following files:

  • sbl-kpax__alignment.txt : the file listing the residues aligned (chain id, resid). Example from the alignment of an helix and a strand
    A 1 A 1
    A 2 A 2
    A 4 A 3
    A 5 A 4
    
  • sbl-kpax__alignment.dot : a dot file to visualize the alignment with graphviz.
  • sbl-kpax__statistics.txt :
  • sbl-kpax__log.txt : with option –log, the log file of the execution.
  • sbl-kpax__scores.txt : file containing two scores for each alignment encountered when running Kpax. The alignments represented as 2D points (lRMSD, GScore) to compute the Pareto envelope
  • sbl-kpax__statistics.txt : various statistics on the execution.
For the file sbl-kpax__alignment.dot : the function SBL::CSB::T_Alignment_engine<SequenceOrStructure, AlignerAlgorithm, FT>::print_alignment_dot() uses the get_name() function of the aligned residue, which is (currently) the one letter code of the amino acid.


Visualization, Plugins, GUIs

Use the option –alignment-viewer to obtain visualization files for VMD or pymol.

Algorithms and Methods

Implementation and workflow

The Kpax implementation is non trivial, as it requires manipulating sequences, structures, and alignments. Part of the complexity comes from the fact that the classes used for Kpax are also used to compute molecular distances (package Molecular_distances_flexible) and search structural motifs (package Structural_motifs).

The overall architecture is as follows:

  • the input structures are encoded into structures for Kpax providing the requires pieces of information,
  • the Kpax algorithm (aligner level) is encapsulated within an alignment engine (the engine level),
  • the main steps of the algorithm are handle by a workflow of modules.

More specifically:

  • The structure level, implemented in the package Kpax itself.
  • The Module/workflow level, which assemble all pieces. The main classes are SBL::CSB::T_Alignment_structures_module and SBL::CSB::T_Iterative_alignment_workflow .

Let us now detail these components.

The structure level: structure for Kpax

Consider the problem of aligning two structures structure_1 and structure_2, respectively containing $n_1$ and $n_2$ units (amino acids for proteins).


Structures.

  • Class T_Alignment_residue: contains the original residue; plus the canonized residue, the coordinates of the CA carbon, and the so-called virtual atoms used to compute the score $\kcomij$.
  • Class T_Structure_from_alignment_residue: inherits from a vector<T_Alignment_residue> and stores in its vector all residues which have a CA carbon. Provides in particular push_back(Alignment_residue(r)) which creates an Alignment_residue for a given residue NB: r.residue_sequence_number() to get the residue id
  • Class T_Structure_for_kpax : inherits from T_Structure_from_alignment_residue
  • Class T_Structure_for_kpax_builder : builds a Structure_for_kpax by collecting residues with a CA, see above.

As explained below, T_Structure_for_kpax is used to instantiate SBL::CSB::T_Iterative_aligner<class StructureType, class SeederType, class ScoreComputer>


Constructing structures from PDB/mmCIF files. To load a PDB/mmCIF files, one iterates on all chains of a protein, collects the $\Calpha$ carbons (from all chains), builds the canonized representations, and collect distances from each $\Calpha$ carbon to neighbors.

The resulting data structure is an instance of T_Structure_for_kpax, which, as explained above, is a vector of T_Alignment_residue.

Two comments are in order:

  • Because the loading step processes all chains, use option –load-chains to specify the subsets of chains to be loaded, one subset for each file.
  • The the alignment will record pairs of indices (in $R_1=[0, n_1-1]$ for structure_1, and in $R_2=[0, n_2-1]$ for structure_2).


Dumping alignments. As just explained, an alignment is recorded as a list of aligned residues $(i_1 \in R_1, i_2 \in R_2)$.

On the other hand, the alignment structure is a vector of alignment residues, such that each contains its original residue.

Therefore, the refs of each aligned residue are easily retrieved from the original residue. e.g., the resid (protein sequence ids) are directly as structure_1.unit_id() or structure_1.residue_sequence_number() .

ref modules below

The aligner level

Iterative_alignment-package: a generic framework for iterative aligners, providing in particular the generic implementation of Kpax. Provides in particular the classes SBL::CSB::T_Iterative_aligner and SBL::CSB::T_Kpax .

The alignment engine level

The engine level combines the algorithm (aligner) and all the calculation of all statistics of interest. Note in particular the following inheritances hierarchy:

SBL::CSB::T_Alignment_engine_structures_kpax inherits from SBL::CSB::T_Alignment_engine_structures inherits from SBL::CSB::T_Alignment_engine

Importantly, the class SBL::CSB::T_Alignment_engine has two data members of type StructureType, named m_sos_1 and m_sos_2 (for my sequence_or_structure).

Traits for Kpax

The traits class used to instantiate Kpax is provided by the package Kpax itself. It defines types that may be ascribed to two groups:

  • Types used to represent proteins, polypeptide chains, etc.
  • The alignment engine itself, namely:
    typedef SBL::CSB::T_Alignment_engine_structures_kpax<Structure>          Alignment_engine;
    

The module and workflow level

An application in the SBL is graph connecting modules, which inherit from the base class Module_base .

Since Kpax addresses an alignment problem, the generic workflow used is SBL::CSB::T_Iterative_alignment_workflow .

The module for Kpax is SBL::CSB::T_Alignment_structures_module, which implements the functions

  • initialize() : the main step consists of loading the input files and constructing the instances of T_Structure_for_kpax_builder .
  • run() : the main step consists of calling the T_Alignment_engine::align() function, which itself calls the operator() of the algorithm (the Kpax algorithm here).
Other important packags used are Protein_representation to represent polypeptide proteins and chains, Molecular_covalent_structure to handle the covalent structures, and Point_cloud_rigid_registration_3 to compute of lRMSD and associated rigid motions.


Jupyter demo

See the following jupyter demo:

  • Jupyter notebook file
  • Kpax