Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.

Authors: F. Cazals and R. Tetley
Structural motifs. Structural alignment methods for polypeptide chains aim at identifying conserved structural motifs between structures.
Global structural alignment methods often fail to exploit local properties, i.e. locally conserved distances. Phrased differently, for similarity measures such as (see Molecular_distances and Molecular_distances_flexible), large values between two structures possibly obliterates regions of smaller size and with a much better structural conservation.
This package provides a method identifying locally conserved structural motifs shared by two polypeptide chains whose structure is known. In a nutshell, a structural motif consists in two sets of a.a., one on each structure, such that the of the motif is significantly smaller than that associated with a global alignment between the two structures:
Applications involving structural motifs. Motifs may be used to compute combined as done in Molecular_distances_flexible , and/or investigate the homology between proteins using profile hidden Markov models as done in FunChaT .
In the sequel, we recall the main concepts and the four steps used to identify motifs, referring the reader to [30] for the details.
For the sake of conciseness, we use structure and polypeptide chain interchangeably; likewise, structural motif and motif are used interchangeably.
Consider two polypeptide chains and of length and . Also consider a seed alignment between them; let be its length, namely the number of a.a. matched.
The alignment consists of pairs of a.a. mapped with one another, and can be computed using various methods. In this package, we use the following two aligners:
Denoting the distance between the carbons of indices i and j on chain , and likewise on chain , an assessment of the matching is provided by the distance difference matrix (DDM), a symmetric matrix whose entries, also called scores, are given by:
Note that these distances qualify the conservation of distances between between the two structures.
In mathematics, a filtration is a (finite) sequence of nested spaces [62] . Our method uses a filtration which may be the conserved distances (CD) filtration or the space filling diagram (SFD) filtration.
Conserved distances (CD) filtration. For each molecule, consider a graph whose vertices represent a.a., and edges are associated with all pairs of a.a (complete graph). Each edge is weighted by the score . Assume that edges have been sorted in ascending order.
To each edge weight–say , corresponds a graph containing only those edges whose weight is . In processing edges by increasing value, these graphs are nested. We call this sequence of graphs the conserved distances (CD) filtration.
Space filling diagram (SFD) filtration. Molecular models defined by union of balls are called space filling diagrams (SFD). Geometric and topological properties of such diagrams are encoded in the socalled complex (see Alpha_complex_of_molecular_model , as well as Space_filling_model_surface_volume ).
Upon sorting the scores by ascending value, define the absolute rank of a given a.a. as the smallest index of a score involving that . Finally, let the rank of a given a.a. be the index of its absolute rank (see details in [30] ).
For each molecule, to each index–say , corresponds a SFD containing only those a.a. whose index is . In processing indices by increasing value, these SFD are nested. We call this sequence of SFD the SFD filtration.
Persistence diagram. For either filtration, let us focus on connected components, which appear and disappear when inserting edges (CD filtration) or a.a. (SFD filtration). A unionfind data structure can be used to maintain these c.c. [46], and track in particular the birth date and the death date of each component.
The collection of all these points defines a 2D diagram called the persistence diagram [62] . Note that all points lie above the diagonal ; moreover, the distance of a point to the diagonal is the lifetime of the c.c. associated to that point.
To compute persistence diagrams, we use the Morse_theory_based_analyzer package.
As previously stated, for each structure we compute a persistence diagram, respectively denoted and .
Two points located in the same region of their respective PD have comparable birth and death dates. We identify such pairs of points. Denote and two such points. We say that and are comparable if the Euclidean distance between the two points (computed by using their respective birth and death dates as coordinates) is below a given threshold .
When two points are comparable, we identify motifs as follows:
As detailed in [30] , motifs are filtered in two ways:
As detailed in [30], motifs can be used to compute the edgeweighted (see Molecular_distances_flexible). Additionally, structural motifs can be used to seed an iterative aligner (see the package Iterative_alignment).
As explained above, the detection of motifs hinges on two ingredients:
Combining these yields four methods:
These four methods are implemented within the following executables: each giving access to the CD and SFD filtrations:
We illustrate how to run the program on one executable. The other executables are used in the same manner.
The main inputs of any of the aforementioned programs are as follows:
A basic calculation is launched as follows:
> sblstructuralmotifsconformations.exe pdbfile data/1URZ.pdb chains A pdbfile data/1SVB.pdb chains A allowincompletechains p 3 motifsize 20 lrmsdthreshold 0.5 d results/conformationssf v l loaddomains data/1URZ.dom loaddomains data/1SVB.dom
File Name  Description 
PDB file  A PDB file (TBEV postfusion glycoprotein) 
Domains specification file  A .dom domain specification file) 
The output provides statistics for all the structural motifs that were found along with a PyMol script that can be used to visualize them.
Rigid blocks: Two main blocks found by the executable. Images generated using the outputted pymol scripts 
File Name  Description 
Plain text file  Run log. 
Plain text file  The initial alignment (for rerunning purposes). 
XML archive file  XML Archive of all the found structural motifs. 
XML archive file  XML Archive of the socalled motif graph (see Molecular_distances_flexible. 
GraphViz file  Hasse diagram of motifs. Each motif is identified with a unique index 
To further investigate each structural motif, the executable outputs PyMol scripts. In order to run the script corresponding to, say motif 8, in the previous run, the user should follow these steps:
load data/1URZ.pdb load data/1SVB.pdb
run results/conformationssf/sblstructuralmotifsconformations__block_8_script.py
A screenshot of the PyMol window after running one of the outputted pymol scripts 
To be added.
The implementation is straightforward: each step described in the previous section is computed in a module, which are all linked together in the following workflow.
The package Structural_motifs uses two novel data structures, the filtration graphs. These were implemented in the classes:
These two classes allow to switch between the different types of filtration.
The package Structural_motifs is centered around two main new modules:
The package also uses the following modules: