Kpax

Authors: F. Cazals and R. Tetley

Goals and algorithm overview

Goals

The $\text{\kpax}$ algorithm is a reference structural aligner developed by the late Dave Ritchie [151] . This package provides a generic implementation of this algorithm. This generic implementation is instantiated for the case of proteins, but other biomolecules could easily be handled.

Algorithm overview

In short, having computed an initial 3D superposition of the two structures based on so-called seeds, Kpax performs a number of iterations involving two steps [151] :

run a dynamic programming step to compute an alignment, based on a score $\ktotij$ measuring the similarity between two residues.

using the alignment obtained, computes a novel rigid motion to superimpose the two structures.

To get insights on the similarity measure used, we only recall the definition of the score $\ktotij$ , based on the following three terms:

$\klocij$ : a local similarity for a pair of residues, calculated in a common coordinate frame. More precisely, this term is calculated as a Gaussian of the weighted sum of the squared distances between pairs of up-stream and down-stream $\Calpha$ atoms.

$\kcomij$ : a term measuring the spatial similarity with respect to the relative position of the center of mass (COM) of a protein (also calculated in the same frame).

$\kbloij$ : the Blosum62 amino acid similarity score

With these, the total Kpax score for two residues of the two structures to be aligned reads as

$\begin{equation} \ktotij = w_l \klocij + w_s \kcomij + w_b \kbloij, w_l + w_s + w_b =1. \end{equation}$

Because Kpax uses the previous score and not the lRMSD, for a given alignment length, the optimal alignment obtained may not be that minimizing the lRMSD.

An important implementation detail for Kpax is the following. The aforementioned iterative process starts from an initial alignment computed from scores encoding so-called canonized fragments. If this alignment is too small (current threshold: 4 a.a.), seeds to start the iterative process cannot be found. In that case, a predefined number of random seeds is computed. A random seed is a crossing free random alignment between the two sequences, whose length is set to the size of the smallest protein.

Using Kpax

Input: Specifications and File Types

The main input consists of two PDB/mmCIF files.

Output: Specifications and File Types

Example call:

sbl-kpax-iterative-alignment.exe -f data/1anx.pdb -f data/2ran.pdb --load-chains A --load-chains A -d output-directory --log

The output consists of the following files:

sbl-kpax__alignment.txt : the file listing the residues aligned (chain id, resid). Example from the alignment of an helix and a strand
```
A 1 A 1
A 2 A 2
A 4 A 3
A 5 A 4
```

sbl-kpax__alignment.dot : a dot file to visualize the alignment with graphviz.

sbl-kpax__statistics.txt :

sbl-kpax__log.txt : with option –log, the log file of the execution.

sbl-kpax__scores.txt : file containing two scores for each alignment encountered when running Kpax. The alignments represented as 2D points (lRMSD, GScore) to compute the Pareto envelope

sbl-kpax__statistics.txt : various statistics on the execution.

For the file sbl-kpax__alignment.dot : the function SBL::CSB::T_Alignment_engine<SequenceOrStructure, AlignerAlgorithm, FT>::print_alignment_dot() uses the get_name() function of the aligned residue, which is (currently) the one letter code of the amino acid.

Visualization, Plugins, GUIs

Use the option –alignment-viewer to obtain visualization files for VMD or pymol.

Algorithms and Methods

Implementation and workflow

The Kpax implementation is non trivial, as it requires manipulating sequences, structures, and alignments. Part of the complexity comes from the fact that the classes used for Kpax are also used to compute molecular distances (package Molecular_distances_flexible) and search structural motifs (package Structural_motifs).

The overall architecture is as follows:

the input structures are encoded into structures for Kpax providing the requires pieces of information,
the Kpax algorithm (aligner level) is encapsulated within an alignment engine (the engine level),
the main steps of the algorithm are handle by a workflow of modules.

More specifically:

The structure level, implemented in the package Kpax itself.

The aligner level, package Iterative_alignment-package: a generic framework for iterative aligners, providing in particular the generic implementation of Kpax. Provides in particular the classes SBL::CSB::T_Iterative_aligner, SBL::CSB::T_Kpax , and SBL::CSB::T_Canonize_conformation (to place a given amino acid into a canonical conformation).

The engine level, package Alignment_engines : combines alignment algorithms and the associated statistics. Provides in particular the generic classes SBL::CSB::T_Alignment_engine (which contains the main method align()), SBL::CSB::T_Alignment_engine_structures, and SBL::CSB::T_Alignment_engine_structures_kpax.

The Module/workflow level, which assemble all pieces. The main classes are SBL::CSB::T_Alignment_structures_module and SBL::CSB::T_Iterative_alignment_workflow .

Let us now detail these components.

The structure level: structure for Kpax

Consider the problem of aligning two structures structure_1 and structure_2, respectively containing and units (amino acids for proteins).

Structures.

Class T_Alignment_residue: contains the original residue; plus the canonized residue, the coordinates of the CA carbon, and the so-called virtual atoms used to compute the score $\kcomij$ .

Class T_Structure_from_alignment_residue: inherits from a vector<T_Alignment_residue> and stores in its vector all residues which have a CA carbon. Provides in particular push_back(Alignment_residue(r)) which creates an Alignment_residue for a given residue NB: r.residue_sequence_number() to get the residue id

Class T_Structure_for_kpax : inherits from T_Structure_from_alignment_residue

Class T_Structure_for_kpax_builder : builds a Structure_for_kpax by collecting residues with a CA, see above.

As explained below, T_Structure_for_kpax is used to instantiate SBL::CSB::T_Iterative_aligner<class StructureType, class SeederType, class ScoreComputer>

Constructing structures from PDB/mmCIF files. To load a PDB/mmCIF files, one iterates on all chains of a protein, collects the $\Calpha$ carbons (from all chains), builds the canonized representations, and collect distances from each $\Calpha$ carbon to neighbors.

The resulting data structure is an instance of T_Structure_for_kpax, which, as explained above, is a vector of T_Alignment_residue.

Two comments are in order:

Because the loading step processes all chains, use option –load-chains to specify the subsets of chains to be loaded, one subset for each file.

The the alignment will record pairs of indices (in for structure_1, and in for structure_2).

Dumping alignments. As just explained, an alignment is recorded as a list of aligned residues $(i_1 \in R_1, i_2 \in R_2)$ .

On the other hand, the alignment structure is a vector of alignment residues, such that each contains its original residue.

Therefore, the refs of each aligned residue are easily retrieved from the original residue. e.g., the resid (protein sequence ids) are directly as structure_1.unit_id() or structure_1.residue_sequence_number() .

ref modules below

The aligner level

Iterative_alignment-package: a generic framework for iterative aligners, providing in particular the generic implementation of Kpax. Provides in particular the classes SBL::CSB::T_Iterative_aligner and SBL::CSB::T_Kpax .

The alignment engine level

The engine level combines the algorithm (aligner) and all the calculation of all statistics of interest. Note in particular the following inheritances hierarchy:

SBL::CSB::T_Alignment_engine_structures_kpax inherits from SBL::CSB::T_Alignment_engine_structures inherits from SBL::CSB::T_Alignment_engine

Importantly, the class SBL::CSB::T_Alignment_engine has two data members of type StructureType, named m_sos_1 and m_sos_2 (for my sequence_or_structure).

Traits for Kpax

The traits class used to instantiate Kpax is provided by the package Kpax itself. It defines types that may be ascribed to two groups:

Types used to represent proteins, polypeptide chains, etc.

The alignment engine itself, namely:

typedef SBL::CSB::T_Alignment_engine_structures_kpax<Structure>          Alignment_engine;

The module and workflow level

An application in the SBL is graph connecting modules, which inherit from the base class Module_base .

Since Kpax addresses an alignment problem, the generic workflow used is SBL::CSB::T_Iterative_alignment_workflow .

The module for Kpax is SBL::CSB::T_Alignment_structures_module, which implements the functions

initialize() : the main step consists of loading the input files and constructing the instances of T_Structure_for_kpax_builder .

run() : the main step consists of calling the T_Alignment_engine::align() function, which itself calls the operator() of the algorithm (the Kpax algorithm here).

report() : dumps the files mentioned in section Output: Specifications and File Types.

Other important packags used are Protein_representation to represent polypeptide proteins and chains, Molecular_covalent_structure to handle the covalent structures, and Point_cloud_rigid_registration_3 to compute of lRMSD and associated rigid motions.

Jupyter demo

See the following jupyter demo:

Jupyter notebook file
Kpax

Iterative_alignment with Kpax¶
Example : Aligning two polypeptidic chains¶
Options.¶
The main options of the align method in the next cell are:

pdb1: Path to the first PDB file

pdb2: Path to the second PDB file

chains1: Chains to load on the first PDB file (none to load all of them)

chains2: Chains to load on the second PDB file (none to load all of them)
In [3]:
import re #regular expressions import sys #misc system import os import pdb import shutil # python 3 only def align(pdb1, pdb2, chains1 = None, chains2 = None): odir = "tmp-results" if os.path.exists(odir): os.system("rm -rf %s" % odir) os.system( ("mkdir %s" % odir) ) # check executable exists and is visible exe = shutil.which("sbl-kpax-iterative-alignment.exe") if exe: print(("Using executable %s\n" % exe)) cmd = "%s -f %s -f %s -v -l -d %s" %\ (exe, pdb1, pdb2, odir) if chains1 != None: cmd += " --chains %s" % chains1 if chains2 != None: cmd += " --chains %s" % chains2 print(cmd) os.system(cmd) cmd = "ls %s" % odir ofnames = os.popen(cmd).readlines() print("All output files:",ofnames) #find the log file and display log file cmd = "find %s -name *log.txt" % odir lines = os.popen(cmd).readlines() if len(lines) > 0: lfname = lines[0].rstrip() print("Log file is:", lfname) log = open(lfname).readlines() for line in log: print(line.rstrip()) else: print("Executable not found")
In [4]:
# Simple example with 2 helices print("Marker : Calculation Started") align("data/theor-helix-5.pdb","data/theor-helix-5.pdb") print("Marker : Calculation Ended")
Marker : Calculation Started Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-kpax-iterative-alignment.exe /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-kpax-iterative-alignment.exe -f data/theor-helix-5.pdb -f data/theor-helix-5.pdb -v -l -d tmp-results All output files: ['sbl-kpax__alignment.dot\n', 'sbl-kpax__alignment.txt\n', 'sbl-kpax__log.txt\n'] Log file is: tmp-results/sbl-kpax__log.txt Running: /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-kpax-iterative-alignment.exe -f data/theor-helix-5.pdb -f data/theor-helix-5.pdb -v -l -d tmp-results Protein representation loader Statistics: Number of loaded PDB files: 2 Details for each file: -- file 1 data/theor-helix-5.pdb: -- -- File exists: Yes -- -- File is valid: Yes -- -- Number of models: 1 -- -- Number of loaded atoms (of model with lowest index): 25 -- -- Number of loaded SS bonds: 0 -- -- Number of discarded atoms based on model id (all models): 0 -- -- Number of discarded atoms based on chain id (all models): 0 -- -- Number of discarded atoms based on b factor beeing too high (all models): 0 -- -- Number of discarded waters (all models): 0 -- -- Number of discarded hetatoms (all models): 0 -- -- Number of discarded hydrogens (all models): 0 -- -- Number of discarded alternate location atoms (all models): 0 -- -- Number of actually discarded atoms (all models): 0 -- file 2 data/theor-helix-5.pdb: -- -- File exists: Yes -- -- File is valid: Yes -- -- Number of models: 1 -- -- Number of loaded atoms (of model with lowest index): 25 -- -- Number of loaded SS bonds: 0 -- -- Number of discarded atoms based on model id (all models): 0 -- -- Number of discarded atoms based on chain id (all models): 0 -- -- Number of discarded atoms based on b factor beeing too high (all models): 0 -- -- Number of discarded waters (all models): 0 -- -- Number of discarded hetatoms (all models): 0 -- -- Number of discarded hydrogens (all models): 0 -- -- Number of discarded alternate location atoms (all models): 0 -- -- Number of actually discarded atoms (all models): 0 Covalent Structure File Loader statistics: -- -- Number of molecular covalent structures: 2 -- Molecular covalent structure 0: -- -- Number of particles: 52 -- -- Number of embedded particles: 25 -- -- Number of bonds: 51 -- -- Number of embedded bonds: 24 -- -- Number of disulfide bonds: 0 -- Molecular covalent structure 1: -- -- Number of particles: 52 -- -- Number of embedded particles: 25 -- -- Number of bonds: 51 -- -- Number of embedded bonds: 24 -- -- Number of disulfide bonds: 0 Protein Representation Loader statistics: Number of loaded proteins: 2 Details for each protein : -- structure 1: -- -- Number of chains: 1 -- -- Number of residues: 5 -- -- Number of particles: 52 -- -- Number of embedded particles: 25 -- -- Number of bonds: 51 -- -- Number of embedded bonds: 24 -- structure 2: -- -- Number of chains: 1 -- -- Number of residues: 5 -- -- Number of particles: 52 -- -- Number of embedded particles: 25 -- -- Number of bonds: 51 -- -- Number of embedded bonds: 24 Iterative alignment Starting module SBL::Modules::T_Alignment_structures_module... +++Seeder score: 5.000000 +++Performing iterative alignment with 1 seeds... (non id case)Iter 1 out of 1 with seed of size 5 (i, lRMSD, size) (0, 0.00000, 4) Converged to a 0-cycle after 2 iterations Decreasing lRMSDs true Statistics: Alignment length: 4 Alignment score : 4.000000 Alignment identity percentage: 1.000000 Alignment similarity percentage: 1.000000 Alignment lRMSD : 0.000000 Alignment dRMSD : 0.000000 Report... Ending module SBL::Modules::T_Alignment_structures_module... End Run General Statistics: Times elapsed for computations (in seconds): -- Iterative alignment: 0.001094 Total: 0.001094 Marker : Calculation Ended
Seeder_DP_score:alignment size:5
In [5]:
# Example from the molecular motions DB, see http://www0.molmovdb.org/cgi-bin/motion.cgi?ID=anxtrp # Motion in Annexin V (Trp motion) [anxtrp] # 1ANX, 1AVR, 2RAN print("Marker : Calculation Started") #align("data/theor-helix-5.pdb","data/theor-helix-5.pdb") align("data/1anx.pdb","data/2ran.pdb") print("Marker : Calculation Ended")
Marker : Calculation Started Using executable /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-kpax-iterative-alignment.exe /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-kpax-iterative-alignment.exe -f data/1anx.pdb -f data/2ran.pdb -v -l -d tmp-results
Seeder_DP_score:alignment size:326
All output files: ['sbl-kpax__alignment.dot\n', 'sbl-kpax__alignment.txt\n', 'sbl-kpax__log.txt\n'] Log file is: tmp-results/sbl-kpax__log.txt Running: /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-kpax-iterative-alignment.exe -f data/1anx.pdb -f data/2ran.pdb -v -l -d tmp-results Protein representation loader Statistics: Number of loaded PDB files: 2 Details for each file: -- file 1 data/1anx.pdb: -- -- File exists: Yes -- -- File is valid: Yes -- -- Number of models: 1 -- -- Number of loaded atoms (of model with lowest index): 7437 -- -- Number of loaded SS bonds: 0 -- -- Number of discarded atoms based on model id (all models): 0 -- -- Number of discarded atoms based on chain id (all models): 0 -- -- Number of discarded atoms based on b factor beeing too high (all models): 0 -- -- Number of discarded waters (all models): 418 -- -- Number of discarded hetatoms (all models): 445 -- -- Number of discarded hydrogens (all models): 0 -- -- Number of discarded alternate location atoms (all models): 91 -- -- Number of actually discarded atoms (all models): 445 -- file 2 data/2ran.pdb: -- -- File exists: Yes -- -- File is valid: Yes -- -- Number of models: 1 -- -- Number of loaded atoms (of model with lowest index): 2487 -- -- Number of loaded SS bonds: 0 -- -- Number of discarded atoms based on model id (all models): 0 -- -- Number of discarded atoms based on chain id (all models): 0 -- -- Number of discarded atoms based on b factor beeing too high (all models): 0 -- -- Number of discarded waters (all models): 108 -- -- Number of discarded hetatoms (all models): 125 -- -- Number of discarded hydrogens (all models): 573 -- -- Number of discarded alternate location atoms (all models): 0 -- -- Number of actually discarded atoms (all models): 698 Covalent Structure File Loader statistics: -- -- Number of molecular covalent structures: 2 -- Molecular covalent structure 0: -- -- Number of particles: 15324 -- -- Number of embedded particles: 7437 -- -- Number of bonds: 15417 -- -- Number of embedded bonds: 7530 -- -- Number of disulfide bonds: 0 -- Molecular covalent structure 1: -- -- Number of particles: 5052 -- -- Number of embedded particles: 2487 -- -- Number of bonds: 5081 -- -- Number of embedded bonds: 2516 -- -- Number of disulfide bonds: 0 Protein Representation Loader statistics: Number of loaded proteins: 2 Details for each protein : -- structure 1: -- -- Number of chains: 3 -- -- Number of residues: 948 -- -- Number of particles: 15324 -- -- Number of embedded particles: 7437 -- -- Number of bonds: 15417 -- -- Number of embedded bonds: 7530 -- structure 2: -- -- Number of chains: 1 -- -- Number of residues: 316 -- -- Number of particles: 5052 -- -- Number of embedded particles: 2487 -- -- Number of bonds: 5081 -- -- Number of embedded bonds: 2516 Iterative alignment Starting module SBL::Modules::T_Alignment_structures_module... +++Seeder score: 286.733085 +++Performing iterative alignment with 2 seeds... (non id case)Iter 1 out of 2 with seed of size 312 (i, lRMSD, size) (0, 0.53412, 315) Converged to a 0-cycle after 2 iterations Decreasing lRMSDs true (non id case)Iter 2 out of 2 with seed of size 316 (i, lRMSD, size) (0, 0.53412, 315) Converged to a 0-cycle after 2 iterations Decreasing lRMSDs true Statistics: Alignment length: 315 Alignment score : 309.044049 Alignment identity percentage: 0.917460 Alignment similarity percentage: 0.955556 Alignment lRMSD : 0.534116 Alignment dRMSD : 0.449635 Report... Ending module SBL::Modules::T_Alignment_structures_module... End Run General Statistics: Times elapsed for computations (in seconds): -- Iterative alignment: 7.027162 Total: 7.027162 Marker : Calculation Ended
In [ ]:

Table of Contents