Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Structural_motifs

Authors: F. Cazals and R. Tetley

Introduction

Structural motifs. Structural alignment methods for polypeptide chains aim at identifying conserved structural motifs between structures.

Global structural alignment methods often fail to exploit local properties, i.e. locally conserved distances. Phrased differently, for similarity measures such as $\lrmsd$ (see Molecular_distances and Molecular_distances_flexible), large values between two structures possibly obliterates regions of smaller size and with a much better structural conservation.

This package provides a method identifying locally conserved structural motifs shared by two polypeptide chains whose structure is known. In a nutshell, a structural motif consists in two sets of a.a., one on each structure, such that the $\lrmsd$ of the motif is significantly smaller than that associated with a global alignment between the two structures:

$ \lRMSDRatio \leq \tau $


Applications involving structural motifs. Motifs may be used to compute combined $\rmsd$ as done in Molecular_distances_flexible , and/or investigate the homology between proteins using profile hidden Markov models as done in FunChaT .

Prerequisites

In the sequel, we recall the main concepts and the four steps used to identify motifs, referring the reader to [30] for the details.

For the sake of conciseness, we use structure and polypeptide chain interchangeably; likewise, structural motif and motif are used interchangeably.

Step 1: Computing the seed alignment and its scores

Consider two polypeptide chains $A$ and $B$ of length $n_A$ and $n_B$. Also consider a seed alignment between them; let $N \leq \min(n_A, n_B)$ be its length, namely the number of a.a. matched.

The alignment consists of pairs of a.a. mapped with one another, and can be computed using various methods. In this package, we use the following two aligners:

  • $\text{\apurva}$ is a structural aligner based on contact maps, favoring long and flexible alignments [9]. See the package Apurva .
  • $\text{\kpax}$ is a structural aligner based on a geometric representation of the backbone, favoring a geometric measure known as the G-score [122]. See the package Iterative_alignment .

Denoting $\distaij$ the distance between the $\Calpha$ carbons of indices i and j on chain $A$, and likewise on chain $B$, an assessment of the matching is provided by the distance difference matrix (DDM), a $N \times N$ symmetric matrix whose entries, also called scores, are given by:

$ \scoreij = \fabs{\distaij - \distbij}, i = 1, \dots, N, j=1, \dots, N $


Note that these distances qualify the conservation of distances between $\Calpha$ between the two structures.

Step 2: Building the filtration and its space filling diagram

In mathematics, a filtration is a (finite) sequence of nested spaces [62] . Our method uses a filtration which may be the conserved distances (CD) filtration or the space filling diagram (SFD) filtration.

Conserved distances (CD) filtration. For each molecule, consider a graph whose vertices represent a.a., and edges are associated with all pairs of a.a (complete graph). Each edge is weighted by the score $\scoreij$. Assume that edges have been sorted in ascending order.

To each edge weight–say $w$, corresponds a graph containing only those edges whose weight is $\leq w$. In processing edges by increasing value, these graphs are nested. We call this sequence of graphs the conserved distances (CD) filtration.

Space filling diagram (SFD) filtration. Molecular models defined by union of balls are called space filling diagrams (SFD). Geometric and topological properties of such diagrams are encoded in the so-called $\alpha$-complex (see Alpha_complex_of_molecular_model , as well as Space_filling_model_surface_volume ).

Upon sorting the scores $\scoreij$ by ascending value, define the absolute $\Calpha$ rank of a given a.a. as the smallest index of a score involving that $\Calpha$. Finally, let the $\Calpha$ rank of a given a.a. be the index of its absolute $\Calpha$ rank (see details in [30] ).

For each molecule, to each index–say $i$, corresponds a SFD containing only those a.a. whose index is $\leq i$. In processing indices by increasing value, these SFD are nested. We call this sequence of SFD the SFD filtration.

Persistence diagram. For either filtration, let us focus on connected components, which appear and disappear when inserting edges (CD filtration) or a.a. (SFD filtration). A union-find data structure can be used to maintain these c.c. [46], and track in particular the birth date and the death date of each component.

The collection of all these points ${(\text{birth date}, \text{death date})}$ defines a 2D diagram called the persistence diagram [62] . Note that all points lie above the diagonal $y=x$; moreover, the distance of a point to the diagonal is the lifetime of the c.c. associated to that point.

To compute persistence diagrams, we use the Morse_theory_based_analyzer package.

Step 3: Computing structural motifs

As previously stated, for each structure we compute a persistence diagram, respectively denoted $\persDiagA$ and $\persDiagB$.

Two points located in the same region of their respective PD have comparable birth and death dates. We identify such pairs of points. Denote $a \in \persDiagA$ and $b\in \persDiagB$ two such points. We say that $a$ and $b$ are comparable if the Euclidean distance between the two points (computed by using their respective birth and death dates as coordinates) is below a given threshold $\tau_{PD}$.

When two points are comparable, we identify motifs as follows:

  • First, we compute the connected components associated to the sublevel set of the graph (CD filtration), or SFD (SFD filtration).
  • Second, we perform a structural alignment between all pairs of connected components.
  • Third, we selected the alignments satisfying the $\text{\lrmsd}$ ratio of Eq. eq-lrmsdratio .
Practically, we impose constraints on motifs:
  • comparable points in PD: we impose a distance between points of at most $\tau_{PD} (= 20)$
  • size constraint: a motif should contain at least $s_0 (= 20)$ a.a.
  • $\text{\lrmsd}$ ratio: should be at most $\tau (= 0.7)$


Note that for a given structure, a structural motif is not necessarily connected, neither on the structure, nor on the sequence. This stems from the fact that a motif is defined upon performing a structural alignment on two connected components (from sublevel sets of the CD or SFD filtration). In practice, though, motifs are generally connected on the structure, but not on the sequence. See details in the companion paper [30] .


Step 4: Filtering structural motifs

As detailed in [30] , motifs are filtered in two ways:

  • Motif inclusion. A Hasse diagram is used to detect the nestedness of motifs. The Hasse diagram is defined from the partial order induced by the inclusion between sets of a.a. defining structural motifs.
  • Statistical significance. The statistical significance of motifs is obtained via a two-sample-test p-value calculation (Wilcoxon Mann-Whitney), comparing the motifs obtained against random motifs.

Bonus steps: combined RMSD and seeding iterative alignment

As detailed in [30], motifs can be used to compute the edge-weighted $\text{\rmsdcomb}$ (see Molecular_distances_flexible). Additionally, structural motifs can be used to seed an iterative aligner (see the package Iterative_alignment).

Using structural motif finders

Executables

As explained above, the detection of motifs hinges on two ingredients:

  • An initial seed alignment. To handle two different proteins, we use two seed aligners ( $\text{\apurva}$, $\text{\kpax}$). To handle two conformations of the same protein, the trivial identity alignment is used.
  • A filtration. We use two options: the filtration of conserved distances (CD), and the filtration based on space filling diagrams (SFD).

Combining these yields four methods:

  • Aligners requiring an alignment: $\text{\alignerkpaxcd}$, $\text{\alignerkpaxsfd}$, $\text{\alignerapurvacd}$, $\text{\alignerapurvasfd}$ .
  • Aligners with the identity alignment: $\text{\aligneridcd}$, $\text{\aligneridsfd}$.

These four methods are implemented within the following executables: each giving access to the CD and SFD filtrations:

  • $\text{\sblsmotifsapurva}$: structural motifs for proteins, using the $\text{\apurva}$ aligner
  • $\text{\sblsmotifskpax}$: structural motifs for proteins, using the $\text{\kpax}$ aligner
  • $\text{\sblsmotifsconf}$: structural motifs for conformations
By default, all the executables search for structural motifs which are then used to compute the $\text{\rmsdcomb}$ and seed an iterative alignment. The results of these two bonus calculations are always output.


Input: Specifications and File Types

We illustrate how to run the program on one executable. The other executables are used in the same manner.

The main inputs of any of the aforementioned programs are as follows:

  • A pair of proteins, in PDB file format.
  • Possibly a specification of domains for these proteins. This enables searching for motifs in nested regions of the protein. The specification file is a simpler version of the one described in MolecularSystemLabelsTraits : one line per domain containing in sequential order the domain name, the number of residue ranges and the residue sequence number ranges (examples below).

A basic calculation is launched as follows:

> sbl-structural-motifs-conformations.exe --pdb-file data/1URZ.pdb --chains A --pdb-file data/1SVB.pdb --chains A --allow-incomplete-chains -p 3 --motif-size 20 --lrmsd-threshold 0.5 -d results/conformations-sf -v -l --load-domains data/1URZ.dom --load-domains data/1SVB.dom
The main options of the program $\text{\sblsmotifsapurva}$ are:
–pdb-file string: PDB file
–chains string: which chain to load
–allow-incomplete-chains: Allow loading incomplete structures
-p: choose the occupancy selection strategy
–motif-size: The motif size threshold $\text{$s_0$}$
–lrmsd-threshold: The $\text{\lrmsd}$ threshold $\text{$\tau$}$
–load-domains: The domains spec file. Domains are defined as residue sequence number ranges.


If no domains are specified using the –load-domains option, the executable searches for a file with the same prefix as the PDB file but with the .dom extension in the PDB file directory.


File Name

Description

PDB file

A PDB file (TBEV post-fusion glycoprotein)

Domains specification file

A .dom domain specification file)

Input files for the run described in section Input: Specifications and File Types .

Output

The output provides statistics for all the structural motifs that were found along with a PyMol script that can be used to visualize them.

Rigid blocks: Two main blocks found by the executable. Images generated using the outputted pymol scripts

File Name

Description

Plain text file

Run log.

Plain text file

The initial alignment (for re-running purposes).

XML archive file

XML Archive of all the found structural motifs.

XML archive file

XML Archive of the so-called motif graph (see Molecular_distances_flexible.

GraphViz file

Hasse diagram of motifs. Each motif is identified with a unique index

Output files for the run described in section Input: Specifications and File Types .

Using the PyMol script

To further investigate each structural motif, the executable outputs PyMol scripts. In order to run the script corresponding to, say motif 8, in the previous run, the user should follow these steps:

  • Open PyMol
  • Load each of the structures. In PyMol:
    load data/1URZ.pdb
    load data/1SVB.pdb
    
  • Run the script corresponding to said motif. In PyMol:
    run results/conformations-sf/sbl-structural-motifs-conformations__block_8_script.py
    

A screenshot of the PyMol window after running one of the outputted pymol scripts
PyMol scripts for viewing the final iterative alignment are also outputted (sbl-structural-motifs-conformations__iterative_alignment.py).


Using the VMD script

To be added.

Implementation and functionalities

The implementation is straight-forward: each step described in the previous section is computed in a module, which are all linked together in the following workflow.

Main classes

The package Structural_motifs uses two novel data structures, the filtration graphs. These were implemented in the classes:

  • SBL::CSB::T_Filtration_graph_SFD< Structure , AlphaComplex >. Given an input structure, it builds a vertex weighted graph. This graph corresponds to the history of contacts between residues upon inserting them in a given order. The order is associated with a filtration function, in our case the $\Calpha$ ranks. The first template parameter is the representation of a proteic structure. It should contain residues which provide access to their $\Calpha$ ranks as well as to their usual information. The second parameter is an alpha complex. Practically, we use the class defined in Alpha_complex_of_molecular_model.
  • SBL::CSB::T_Filtration_graph_CD< Structure >. Given an input structure, it builds an edge weighted graph. The vertices are the set of residues, and the edges correspond to all pairs of a.a. (complete graph). The edge weights correspond to the order in which a residue pair appears in the ordering of $\text{$s_{ij}$}$ scores. The template parameter is the representation of a proteic structure. .

These two classes allow to switch between the different types of filtration.

Modules

The package Structural_motifs is centered around two main new modules:

The package also uses the following modules:

SBL::Modules::T_Iterative_alignment_module< ModuleTraits > was created as a new module in this package because it goes beyond the use cases of the module in Iterative_alignment.


T_Structural_motifs_workflow: