Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Alignment_engines

Authors: F. Cazals and T. Dreyfus and S. Marillet and R. Tetley

Introduction

Define an alignment engine as an alignment algorithms producing suitable statistics. This package unifies the two main ways alignments are done in bioinformatics, namely sequence alignments and structural alignments:

  • sequence alignments aim to match units (amino acids or nucleotids) of multiple sequences only regarding their positions in the sequences and some similarity between these units. Typically, a substitution score matrix determines the penalties for matched units that are not identical, or for creating or prolonging gaps (i.e, sequences of units that do not match any unit in the other sequences).

  • structural alignments consider also the geometry, meaning that the position of the atoms will play a fundamental role in the matching of the units of the structure.

Defining a common framework for both allows factoring out common bits, in particular statistics on the aligned sequences. Note also that this package does not provide any algorithm, per se, as sequence and structural alignments are borrowed from external libraries.

Implementation

Alignment engines are designed so that it is easy to wrap an existing sequence or structural alignment algorithm into the engines. In order to do so, we first establish the terminology used in the package, then explain the design choices.

Terminology

An alignment unit , or unit for short, is the smallest biophysical entity to be aligned. In general, a unit is an amino-acid or a nucleotide.


A sequence is a succession of units, each unit being characterized by a name and a position in the sequence. The position of a unit in the sequence is also referred as the index of the unit within the sequence. Indices follow the $0..n-1$ convention. If each unit is also equipped with cartesian coordinates, it is geometrically embedded and the sequence may also be called a structure.


A SOS is an abstract entity representing either a Sequence Or a Structure.


An aligner is an algorithm aligning multiple SOS. An alignment is the output of an aligner, that is a list of tuples of indices of matching units. Gaps are represented by negative indices (-1 by default).


An alignment engine is a framework that uses an aligner for aligning multiple SOS, and provide useful statistics for analyzing the output of the aligner.


The alignment engines provided by this package handle only pairwise alignments, so that the tuples are pairs of indices.


Design

The alignment engines are designed so has to be independent from the aligners, meaning that aligners are template parameters of the engines. For this reason, the main difference between structural and sequence alignment engines are the analysis : structural sequence alignment engines provide geometric analysis in addition to all the analysis provided by the sequence alignment engines.

For this reason, structural alignment engines are seen as specializations of sequence alignment engines. Thus, there are two main classes :

  • SBL::CSB::T_Alignment_engine< SequenceOrStructure , AlignerAlgorithm , FT > : the base class for sequence alignments. The parameter SequenceOrStructure is the representation of a SOS, AlignerAlgorithm is the aligner itself, and FT is the number type used for rendering the analysis.

  • SBL::CSB::T_Alignment_engine_structures< StructureType , AlignerAlgorithm , FT , MolecularDistance > : the base class for structural alignments. The parameter StructureType is the representation of a structure, AlignerAlgorithm is the structural aligner itself, FT is the number type used for rendering the analysis, and MolecularDistance is a distance between two structures for comparing the output alignments.

The different template parameters have a number of requirements :

  • SequenceOrStructure : defines the base types used in the engine (Alignment_unit, Alignment_unit_name for the name of the unit, and Alignment_unit_rep for the index of the unit), and provides an array operator for accessing to any unit of the SOS at a given index. The type Alignment_unit must provide the method get_name() returning the name of the unit.

  • StructureType : in addition to the requirements for SequenceOrStructure , the type Alignment_unit has to also provide the methods x(), y() and z() returning the cartesian coordinates of the unit.

  • AlignerAlgorithm : a simple functor taking as argument references to the two SOS to align, and returning the score of the algorithm together with the alignment.

  • FT : there is no particular requirement, and the default type used is double.

  • MolecularDistance : see the package Molecular_distances for a complete description of this parameter. It is set by default to the class SBL::CSB::T_Least_RMSD_cartesian.

Both engines are generic and can handle a variety of aligners. For each alignment engine, a specialization with a given aligner is provided:

  • SBL::CSB::T_Alignment_engine_sequences_seqan< SequenceType , SeqanSequenceConverter , FreeEndsAlignment , ScoreType , SeqanUnitType , SeqanCustomMatrix , SeqanAlgorithm > : specialization of the sequence alignment engine with aligners from the Seqan library. The parameter SequenceType has the exact same requirements as the previous parameter SequenceOrStructure . The parameter SeqanSequenceConverter is a functor converting the name of a unit in SequenceType to the name of a unit in Seqan (by default, it does nothing). All other parameters are specific to Seqan, and we recommend the user to visit the documentation of the Seqan library.

  • SBL::CSB::T_Alignment_engine_structures_apurva< StructureType , FT , MolecularDistance > : specialization of the structural alignment engine with aligners from Apurva [11] . All parameters have been already described previously.

Functionality

There are three sets of functionality :

  • the generic alignment analysis : all engines provide at least the score of the aligner (depending on the aligner algorithm), the percentage of similarity (that is the score obtained by summing the substitution score of paired entities) and identity (that is the percentage of similarity for the identity substitution matrix) of an alignment given an input substitution matrix, and the possibility to view the alignment using Graphviz .

  • the structural alignment analysis : structural alignment engines provide in addition the Distance Difference Matrix (DDM), the dRMSD computed from the DDM, and the molecular distance between the two aligned structures.

  • the modules : structural and sequence alignment engines are embedded in two modules for a simple use within the Module framework of the SBL : SBL::Modules::T_Alignment_sequences_module and SBL::Modules::T_Alignment_structures_module

In addition of these functionality, the program $\text{sbl-match-PDB-residues-and-atoms.exe}$ provides the following functionality:

  • 1-to-1 matching : given two atomic models of two polypeptide chains, the program matches the residues with a sequence alignment, then matches the atoms of matched residues; results are stored in two files, one with the residue ids of the matched residues, the other one with the atomic serial numbers of the matched atoms;

  • n-to-m matching : consider two molecular structures involving and n and m polypetide chains respectively, for each pair out of (n,m), the result of the previous calculation is reported (the number of matched residues and of matched atoms).

Examples

The following examples show how to use the two specializations of the alignment engines with Seqan and Apurva, and how to use the two provided modules.

Sequence alignments with Seqan

This example loads an input set of polypeptidic chains from two PDB files and align their sequences. Then it dumps the aligner score from Seqan, together with the identity and similarity percentages. Note that a fifth argument is required corresponding to the occupancy policy, i.e how atoms with an occupancy lower than 1 in the PDB file should be treated.

//Example call : ./example_alignment_engine_sequences.exe pdb1 chains1 pdb2 chains2 occupancy
#include <iostream>
//Representation of the atoms, with attached system's label
#include <SBL/Models/Atom_with_flat_info_and_annotations_traits.hpp>
//Load a PDB file
#include <SBL/Models/PDB_file_loader.hpp>
#include <SBL/CSB/Alignment_engine_sequences_seqan.hpp>
#include <SBL/CSB/Alignment_sequence.hpp>
int main(int argc, char *argv[])
{
if(argc < 6)
return -1;
//Loads a PDB file.
Molecular_geometry_loader loader;
loader.set_loaded_water(false);
loader.set_occupancy_factor(atoi(argv[5]));
loader.add_input_file_name(argv[1]);
loader.set_loaded_chains(0, argv[2]);
loader.add_input_file_name(argv[3]);
loader.set_loaded_chains(1, argv[4]);
loader.load(true, std::cout);
Sequence seq_1
(loader.get_geometric_model(0).residues_begin(),
loader.get_geometric_model(0).residues_end());
Sequence seq_2
(loader.get_geometric_model(1).residues_begin(),
loader.get_geometric_model(1).residues_end());
Alignment_engine engine(seq_1, seq_2);
//These matrices contain 26X26 values from seqn 2.0.0
// engine.set_blosum_30();
// engine.set_blosum_45();
// engine.set_blosum_62();
engine.set_blosum_80();
engine.align();
std::cout << "Algorithm score : " << engine.get_score() << std::endl;
std::cout << "Identity percentage : " << engine.get_identity_percentage() << std::endl;
std::cout << "Similarity percentage : " << engine.get_similarity_percentage() << std::endl;
return 0;
}
Engine for making alignments between sequences using Seqan.
Definition: Alignment_engine_sequences_seqan.hpp:366
Defines a generic serializable atom with annotations (default is name, radius and optional annotation...
Definition: Atom_with_flat_info_and_annotations_traits.hpp:121
Loader for one or more PDB files, even listed in a file. Loader for one or more PDB files,...
Definition: PDB_file_loader.hpp:94
Definition: PDB_residues_and_atoms_matching.hpp:92

See the reference manual of SBL::CSB::T_Alignment_sequence for the definition of the Sequence data structure.

The file "SBL/CSB/residue_to_one_letter_code.hpp" defines the method residue_to_one_letter_code(), which converts the 3 letters code of a residue to its 1 letter version.


Structural alignments with Apurva

This example loads an input set of polypeptidic chains from two PDB files and align their structures. The coordinates of the residues are the coordinates of the carbon $\alpha$ of the residue. Then it dumps the aligner score from Apurva, the identity and similarity percentages, and the dRMSD and the lRMSD of the output alignment. Note that a fifth argument is required corresponding to the occupancy policy, i.e how atoms with an occupancy lower than 1 in the PDB file should be treated.

//Example call : ./example_alignment_engine_sequences.exe pdb1 chains1 pdb2 chains2 occupancy
#include <iostream>
//Representation of the atoms, with attached system's label
#include <SBL/Models/Atom_with_flat_info_and_annotations_traits.hpp>
//Load a PDB file
#include <SBL/Models/PDB_file_loader.hpp>
#include <SBL/CSB/Alignment_engine_structures_apurva.hpp>
#include <SBL/CSB/Alignment_sequence.hpp>
int main(int argc, char *argv[])
{
if(argc < 6)
return -1;
//Loads a PDB file.
Molecular_geometry_loader loader;
loader.set_loaded_water(false);
loader.set_occupancy_factor(atoi(argv[5]));
loader.add_input_file_name(argv[1]);
loader.set_loaded_chains(0, argv[2]);
loader.add_input_file_name(argv[3]);
loader.set_loaded_chains(1, argv[4]);
loader.load(true, std::cout);
Structure struct_1
(loader.get_geometric_model(0).residues_begin(),
loader.get_geometric_model(0).residues_end());
Structure struct_2
(loader.get_geometric_model(1).residues_begin(),
loader.get_geometric_model(1).residues_end());
Alignment_engine engine(struct_1, struct_2);
//These matrices contain 26X26 values from seqn 2.0.0
// engine.set_blosum_30();
// engine.set_blosum_45();
// engine.set_blosum_62();
engine.set_blosum_80();
engine.align();
std::cout << "Algorithm score : " << engine.get_score() << std::endl;
std::cout << "Identity percentage : " << engine.get_identity_percentage() << std::endl;
std::cout << "Similarity percentage : " << engine.get_similarity_percentage() << std::endl;
std::cout << "dRMSD : " << engine.get_dRMSD() << std::endl;
std::cout << "lRMSD : " << engine.get_lRMSD() << std::endl;
return 0;
}
Base engine for making alignments between structures using Apurva.
Definition: Alignment_engine_structures_apurva.hpp:152
The definition of the Structure data structure is the same as the Sequence data structure.


Note that Apurva can filter the alignment with Secondary Structure Elements (SSE), if these are provided. The Sequence data structure is designed so that Apurva uses the SSE information in the PDB file to filter the alignment.

Module for sequence alignments

This example is similar to the example in section Sequence alignments with Seqan, except that it uses a module to perform the work. In particular, the alignment and the analysis are done within the module. It also offers the possibility to report the alignment into a Graphviz file.

//Example call : ./example_alignment_engine_sequences.exe pdb1 chains1 pdb2 chains2 occupancy
#include <iostream>
//Representation of the atoms, with attached system's label
#include <SBL/Models/Atom_with_flat_info_and_annotations_traits.hpp>
//Load a PDB file
#include <SBL/Models/PDB_file_loader.hpp>
#include <SBL/Modules/Alignment_sequences_module.hpp>
#include <SBL/CSB/Alignment_sequence.hpp>
//Module Traits
class Module_traits
{
public:
typedef double FT;
};//end class Module_traits;
int main(int argc, char *argv[])
{
if(argc < 6)
return -1;
//Loads a PDB file.
Molecular_geometry_loader loader;
loader.set_loaded_water(false);
loader.set_occupancy_factor(atoi(argv[5]));
loader.add_input_file_name(argv[1]);
loader.set_loaded_chains(0, argv[2]);
loader.add_input_file_name(argv[3]);
loader.set_loaded_chains(1, argv[4]);
loader.load(true, std::cout);
//Runs the alignment and dump statistics.
Alignment_sequences_module alignment_module;
(loader.get_geometric_model(0).residues_begin(),
loader.get_geometric_model(0).residues_end());
(loader.get_geometric_model(1).residues_begin(),
loader.get_geometric_model(1).residues_end());
alignment_module.run(true, std::cout);
alignment_module.statistics(std::cout);
alignment_module.report("alignment_module_");
delete alignment_module.get_first_sequence();
delete alignment_module.get_second_sequence();
}
Module which computes a pairwise alignment of two sequences Module which computes a pairwise alignmen...
Definition: Alignment_sequences_module.hpp:81
void run(unsigned verbose, std::ostream &out)
Runs the module following the input options.
Definition: Alignment_sequences_module.hpp:337
void statistics(std::ostream &out)
Reports high-level statistics on the module.
Definition: Alignment_sequences_module.hpp:357
Sequence *& get_second_sequence(void)
Definition: Alignment_sequences_module.hpp:289
void report(const std::string &prefix)
Reports the output and statistics in output files.
Definition: Alignment_sequences_module.hpp:402
Sequence *& get_first_sequence(void)
Definition: Alignment_sequences_module.hpp:281

Module for structural alignments

This example is similar to the example in section Structural alignments with Apurva, except that it uses a module to perform the work. In particular, the alignment and the analysis are done within the module. It also offers the possibility to report the alignment into a Graphviz file.

//Example call : ./example_alignment_engine_sequences.exe pdb1 chains1 pdb2 chains2 occupancy
#include <iostream>
//Representation of the atoms, with attached system's label
#include <SBL/Models/Atom_with_flat_info_and_annotations_traits.hpp>
//Load a PDB file
#include <SBL/Models/PDB_file_loader.hpp>
#include <SBL/Modules/Alignment_structures_module.hpp>
#include <SBL/CSB/Alignment_sequence.hpp>
//Module Traits
class Module_traits
{
public:
typedef double FT;
};//end class Module_traits;
int main(int argc, char *argv[])
{
if(argc < 6)
return -1;
//Loads a PDB file.
Molecular_geometry_loader loader;
loader.set_loaded_water(false);
loader.set_occupancy_factor(atoi(argv[5]));
loader.add_input_file_name(argv[1]);
loader.set_loaded_chains(0, argv[2]);
loader.add_input_file_name(argv[3]);
loader.set_loaded_chains(1, argv[4]);
loader.load(true, std::cout);
//Runs the alignment and dump statistics.
Alignment_structures_module alignment_module;
(loader.get_geometric_model(0).residues_begin(),
loader.get_geometric_model(0).residues_end());
(loader.get_geometric_model(1).residues_begin(),
loader.get_geometric_model(1).residues_end());
alignment_module.run(true, std::cout);
alignment_module.statistics(std::cout);
alignment_module.report("alignment_module_");
}
Model of Distances for defining distance between two conformations.
Definition: Binet_Cauchy_kernel_score.hpp:101
Model of Distances for defining distance between two conformations.
Definition: Least_RMSD_cartesian.hpp:84
Module which computes a pairwise alignment of two structures Module which computes a pairwise alignme...
Definition: Alignment_structures_module.hpp:81
void run(unsigned verbose, std::ostream &out)
Runs the module following the input options.
Definition: Alignment_structures_module.hpp:384
void statistics(std::ostream &out)
Reports high-level statistics on the module.
Definition: Alignment_structures_module.hpp:408
Structure *& get_first_structure(void)
Definition: Alignment_structures_module.hpp:317
void report(const std::string &prefix)
Reports the output and statistics in output files.
Definition: Alignment_structures_module.hpp:455
Structure *& get_second_structure(void)
Definition: Alignment_structures_module.hpp:325