Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Alignment_engines

Authors: F. Cazals and T. Dreyfus and S. Marillet and R. Tetley

Introduction

Define an alignment engine as an alignment algorithms producing suitable statistics. This package unifies the two main ways alignments are done in bioinformatics, namely sequence alignments and structural alignments:

  • sequence alignments aim to match units (amino acids or nucleotids) of multiple sequences only regarding their positions in the sequences and some similarity between these units. Typically, a substitution score matrix determines the penalties for matched units that are not identical, or for creating or prolonging gaps (i.e, sequences of units that do not match any unit in the other sequences).

  • structural alignments consider also the geometry, meaning that the position of the atoms will play a fundamental role in the matching of the units of the structure.

Defining a common framework for both allows factoring out common bits, in particular statistics on the aligned sequences. Note also that this package does not provide any algorithm, per se, as sequence and structural alignments are borrowed from external libraries.

Implementation

Alignment engines are designed so that it is easy to wrap an existing sequence or structural alignment algorithm into the engines. In order to do so, we first establish the terminology used in the package, then explain the design choices.

Terminology

An alignment unit , or unit for short, is the smallest biophysical entity to be aligned. In general, a unit is an amino-acid or a nucleotide.


A sequence is a succession of units, each unit being characterized by a name and a position in the sequence. The position of a unit in the sequence is also referred as the index of the unit within the sequence. Indices follow the $0..n-1$ convention. If each unit is also equipped with cartesian coordinates, it is geometrically embedded and the sequence may also be called a structure.


A SOS is an abstract entity representing either a Sequence Or a Structure.


An aligner is an algorithm aligning multiple SOS. An alignment is the output of an aligner, that is a list of tuples of indices of matching units. Gaps are represented by negative indices (-1 by default).


An alignment engine is a framework that uses an aligner for aligning multiple SOS, and provide useful statistics for analyzing the output of the aligner.


The alignment engines provided by this package handle only pairwise alignments, so that the tuples are pairs of indices.


Design

The alignment engines are designed so has to be independent from the aligners, meaning that aligners are template parameters of the engines. For this reason, the main difference between structural and sequence alignment engines are the analysis : structural sequence alignment engines provide geometric analysis in addition to all the analysis provided by the sequence alignment engines.

For this reason, structural alignment engines are seen as specializations of sequence alignment engines. Thus, there are two main classes :

  • SBL::CSB::T_Alignment_engine< SequenceOrStructure , AlignerAlgorithm , FT > : the base class for sequence alignments. The parameter SequenceOrStructure is the representation of a SOS, AlignerAlgorithm is the aligner itself, and FT is the number type used for rendering the analysis.

  • SBL::CSB::T_Alignment_engine_structures< StructureType , AlignerAlgorithm , FT , MolecularDistance > : the base class for structural alignments. The parameter StructureType is the representation of a structure, AlignerAlgorithm is the structural aligner itself, FT is the number type used for rendering the analysis, and MolecularDistance is a distance between two structures for comparing the output alignments.

The different template parameters have a number of requirements :

  • SequenceOrStructure : defines the base types used in the engine (Alignment_unit, Alignment_unit_name for the name of the unit, and Alignment_unit_rep for the index of the unit), and provides an array operator for accessing to any unit of the SOS at a given index. The type Alignment_unit must provide the method get_name() returning the name of the unit.

  • StructureType : in addition to the requirements for SequenceOrStructure , the type Alignment_unit has to also provide the methods x(), y() and z() returning the cartesian coordinates of the unit.

  • AlignerAlgorithm : a simple functor taking as argument references to the two SOS to align, and returning the score of the algorithm together with the alignment.

  • FT : there is no particular requirement, and the default type used is double.

  • MolecularDistance : see the package Molecular_distances for a complete description of this parameter. It is set by default to the class SBL::CSB::T_Least_RMSD_cartesian.

Both engines are generic and can handle a variety of aligners. For each alignment engine, a specialization with a given aligner is provided:

  • SBL::CSB::T_Alignment_engine_sequences_seqan< SequenceType , SeqanSequenceConverter , FreeEndsAlignment , ScoreType , SeqanUnitType , SeqanCustomMatrix , SeqanAlgorithm > : specialization of the sequence alignment engine with aligners from the Seqan library. The parameter SequenceType has the exact same requirements as the previous parameter SequenceOrStructure . The parameter SeqanSequenceConverter is a functor converting the name of a unit in SequenceType to the name of a unit in Seqan (by default, it does nothing). All other parameters are specific to Seqan, and we recommend the user to visit the documentation of the Seqan library.

  • SBL::CSB::T_Alignment_engine_structures_apurva< StructureType , FT , MolecularDistance > : specialization of the structural alignment engine with aligners from Apurva [11] . All parameters have been already described previously.

Functionality

There are three sets of functionality :

  • the generic alignment analysis : all engines provide at least the score of the aligner (depending on the aligner algorithm), the percentage of similarity (that is the score obtained by summing the substitution score of paired entities) and identity (that is the percentage of similarity for the identity substitution matrix) of an alignment given an input substitution matrix, and the possibility to view the alignment using Graphviz .

  • the structural alignment analysis : structural alignment engines provide in addition the Distance Difference Matrix (DDM), the dRMSD computed from the DDM, and the molecular distance between the two aligned structures.

  • the modules : structural and sequence alignment engines are embedded in two modules for a simple use within the Module framework of the SBL : SBL::Modules::T_Alignment_sequences_module and SBL::Modules::T_Alignment_structures_module

In addition of these functionality, the program $\text{sbl-match-PDB-residues-and-atoms.exe}$ provides the following functionality:

  • 1-to-1 matching : given two atomic models of two polypeptide chains, the program matches the residues with a sequence alignment, then matches the atoms of matched residues; results are stored in two files, one with the residue ids of the matched residues, the other one with the atomic serial numbers of the matched atoms;

  • n-to-m matching : consider two molecular structures involving and n and m polypetide chains respectively, for each pair out of (n,m), the result of the previous calculation is reported (the number of matched residues and of matched atoms).

Multiple Sequence Alignments

We consider a set of sequences and possibly associated structures; $n_i$ stands for the number of residues in the structure, and $s_i$ the length of the associated primary sequence.

Data structures for MSA

The class SBL::CSB::T_Multiple_sequence_alignment_DS to store MSA provided FASTA files.

In working with MSA, one must handle - symbols which account for unaligned units. This is done using two mechanisms:

A MSA position is valid provided there are at most $t$ dashes in the corresponding MSA column. The number of valid positions is denoted $N_v\leq N$, with $N$ the length (number of columns) of the MSA.


If $t=0$, a column in the MSA is valid iff no sequence contains - at this position.

In using MSA, three types of indices are of interest:

  • the alignment index in $0..N-1$,
  • the sequence/chain index for each sequence, in $0.. s_i-1$
  • the valid alignment index, in $0..N_v-1$.

Consider the following types:

typedef std::vector<unsigned> Alignment_index_to_chain_index;
typedef std::vector<unsigned> Chain_index_to_alignment_index;
typedef std::vector<Alignment_index_to_chain_index> Alignment_index_to_chain_index_table;
typedef std::vector<Chain_index_to_alignment_index> Chain_index_to_alignment_index_table;

Using them, the following data structures are provided to navigate between alignment indices, sequence indices, and valid alignment indices:

Alignment_index_to_chain_index_table m_alignment_idx_to_chain_idx;
Chain_index_to_alignment_index_table m_chain_idx_to_alignment_idx;
Alignment_index_to_chain_index_table m_valid_alignment_idx_to_chain_idx;
Chain_index_to_alignment_index_table m_chain_idx_to_valid_alignment_idx;

Practically, one calls the function m_msa->get_valid_length() to get the number $N_v$ of valid positions. Then, with $i\in 0..N_v-1$, the sequence index is obtained via

m_msa->valid_alignment_index_to_sequence_index(chain_index,i)

Contiguous indexers

An important case is that where the sequences of the MSA are accompanied by structures.

The class SBL::CSB::T_Chains_residues_contiguous_indexer is instantiated from a set of structures – e.g. polypeptide chains of the package Protein_representation. For each sequence/structure, it contains two pieces of information.


Structural information. Assume the i-th chain has gaps–some residues have not been reported. In that case, $n_i<s_i$. The contiguous index of a residue is an integer in the range $0..n_i-1$.

For each sequence/structure, the class SBL::CSB::T_Chains_residues_contiguous_indexer records a map from $0..s_i-1$ to $0..n_i-1$ : it maps a resid (minus one to be precise) to a contiguous index in $0..n_i-1 \cup {\infty }$. This map can be queried to know whether a particular residue is present in the structure.


Sequence information. A sequence can be retrieved from the seqres field of the structure file. Or it can be built while parsing the atoms of the structure. In this latter case, if $m_i$ is the largest residue sequence number in the i-th structure, the corresponding sequence has length $m_i$. Absent residues are denoted with the letter X in the sequence.

Data structures for filtered MSA

Def. def-valid-msa focuses solely on sequences in the MSA. We enrich it to account for the presence of residues in structures:

Consider a MSA and the associated structures. A position of the MSA is valid if it is valid if the sense of Def. def-valid-msa-bis&ndash;with no gap tolerated, and if the residue associated to a MSA position is present in every structure.


The class SBL::CSB::T_Multiple_sequence_alignment_DS_filtered specializes SBL::CSB::T_Multiple_sequence_alignment_DS. In particular, the function compute_nb_missing_residue_per_position() which is used to identify the valid position, incorporates the call to a functor providing the is_missing() function:

// parse all sequences and count missing residues at position j of the MSA
for(i=0; i<Base::m_MSA.size(); i++){
bool missing_in_seq = Base::m_MSA[i][j] == Base::m_alignment_blank_unit;
bool missing_in_struct = (m_unit_filter != NULL && m_unit_filter->is_missing(i, Base::m_alignment_idx_to_chain_idx[i][j]));
missing_at_j += missing_in_seq || missing_in_struct;
}

See the section below for a toy example using such a functor.

A non trivial example is provided in the package Spectrus-package, where valid positions in the MSA are invalidated if one (or several) structures miss a particular residue, using an instance of the class SBL::CSB::T_Chains_residues_contiguous_indexer .

Examples

The following examples show how to use the two specializations of the alignment engines with Seqan and Apurva, and how to use the two provided modules.

Sequence alignments with Seqan

This example loads an input set of polypeptidic chains from two PDB files and align their sequences. Then it dumps the aligner score from Seqan, together with the identity and similarity percentages. Note that a fifth argument is required corresponding to the occupancy policy, i.e how atoms with an occupancy lower than 1 in the PDB file should be treated.

See the reference manual of SBL::CSB::T_Alignment_sequence for the definition of the Sequence data structure.

The file "SBL/CSB/residue_to_one_letter_code.hpp" defines the method residue_to_one_letter_code(), which converts the 3 letters code of a residue to its 1 letter version.


Structural alignments with Apurva

This example loads an input set of polypeptidic chains from two PDB files and align their structures. The coordinates of the residues are the coordinates of the carbon $\alpha$ of the residue. Then it dumps the aligner score from Apurva, the identity and similarity percentages, and the dRMSD and the lRMSD of the output alignment. Note that a fifth argument is required corresponding to the occupancy policy, i.e how atoms with an occupancy lower than 1 in the PDB file should be treated.

The definition of the Structure data structure is the same as the Sequence data structure.


Note that Apurva can filter the alignment with Secondary Structure Elements (SSE), if these are provided. The Sequence data structure is designed so that Apurva uses the SSE information in the PDB file to filter the alignment.

Module for sequence alignments

This example is similar to the example in section Sequence alignments with Seqan, except that it uses a module to perform the work. In particular, the alignment and the analysis are done within the module. It also offers the possibility to report the alignment into a Graphviz file.

Module for structural alignments

This example is similar to the example in section Structural alignments with Apurva, except that it uses a module to perform the work. In particular, the alignment and the analysis are done within the module. It also offers the possibility to report the alignment into a Graphviz file.

MSA: filtered versus non filtered

We provide a toy example to compare the data structures SBL::CSB::T_Multiple_sequence_alignment_DS and SBL::CSB::T_Multiple_sequence_alignment_DS_filtered .


Toy example. To describe the processing done in sbl-test-msa-ds.cpp (see snippet below), consider the following alignment between two chains of length 7:

> cat  ../../../data/fictitious-2prots.fasta
>A
AAA-AAA
>B
A-AAAAA

Also consider the following example filter, which retains even columns of the MSA

// Test: even residues missing
class AA_even_index_filter{
public:
bool is_missing(unsigned seq_index, unsigned chain_index){return chain_index <= 2;}
};
// No filter
// Previous filter
Definition: Multiple_sequence_alignment_DS_filtered.hpp:59
Definition: Multiple_sequence_alignment_DS.hpp:75

Upon building instances of the classes MSA_one_letter_code and MSA_one_letter_code_filtered on our toy alignment with two sequences, we get:

=== MSA
Chain and size:A 6
AAA-AAA
Chain and size:B 6
A-AAAAA
File chain is A; 7 indices in the table m_alignment_index_to_chain_index
0 1 2 4294967295 3 4 5 
File chain is B; 7 indices in the table m_alignment_index_to_chain_index
0 4294967295 1 2 3 4 5 
File chain is A; 5 indices in the table m_valid_alignment_idx_to_chain_idx
0 2 3 4 5 
File chain is B; 5 indices in the table m_valid_alignment_idx_to_chain_idx
0 1 3 4 5 


=== MSA-filtered
Chain and size:A 6
AAA-AAA
Chain and size:B 6
A-AAAAA
File chain is A; 7 indices in the table m_alignment_index_to_chain_index
0 1 2 4294967295 3 4 5 
File chain is B; 7 indices in the table m_alignment_index_to_chain_index
0 4294967295 1 2 3 4 5 
File chain is A; 3 indices in the table m_valid_alignment_idx_to_chain_idx
3 4 5 
File chain is B; 3 indices in the table m_valid_alignment_idx_to_chain_idx
3 4 5 

Note in particular that if one filters out - and position for which sequences indices are less than 2, sequences indices 3, 4 and 5 only remain valid. (NB: sequence indices are sequence specific and run from 0 to 5 for each of our two toy sequences.)

See the snippet