Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Molecular_distances_flexible

Authors: F. Cazals and R. Tetley

Goals

Goal. The root mean square deviation (RMSD) and the least RMSD which are provided in the companion package Molecular_distances are two widely used similarity measures in structural bioinformatics. Yet, they stem from global comparisons, possibly obliterating locally conserved motifs.

To foster our understanding of molecular flexibility, this package, based upon developments presented in [29], provides so-called combined $\rmsd$, which mixes independent $\lrmsd$ measures, each computed with its own optimal rigid motion.

The structural units which may be domains, subdomain, SSE, etc, are defined using the machinery from the package MolecularSystemLabelsTraits

The combined $\rmsd$ can be used to compare (quaternary) structures based on motifs defined from the sequence (domains, SSE), or to compare structures based on structural motifs yielded by local structural alignment methods. To handle these situations, the following three executables are provided:

  • $\text{\sblrmsdflexconf}$ to compare conformations of a molecule–which can be done without computing a non-trivial alignment,
  • $\text{\sblrmsdflexprot}$ to compare homologous proteins,
  • $\text{\sblrmsdflexmot}$ to compare structural motifs – see also the companion package Structural_motifs
As noticed above, regions are defined using the labels, as indicated in the package MolecularSystemLabelsTraits. In the sequel, it is assumed that regions involve 4 or more residues. If not, a warning is issued, and the $\text{\lrmsd}$ output for the corresponding comparisons is set to -1.


Modes. As in the package Molecular_distances, four modes are provided:

  • Calpha atoms: distances are computed on Calpha.
  • Backbone atoms: only Calpha, C and N atoms.
  • Heavy atoms: distances are computed on all atoms but hydrogen atoms.
  • All atoms: distances are computed on all atoms, including hydrogen atoms.

Combined RMSD for proteins

Pre-requisites

Vertex weighted $\lrmsd$. Consider two point sets $A=\{ a_i \}$ and $B=\{ b_i \}$ of size $N$. Naturally, each point corresponds to an atom or pseudo-atom – which we generically call a particle.

Also consider a set of positive weights $\{ w_i\}_{i=1,\dots, N}$, meant to stress the importance of certain points. The weighted $\rmsd$ reads as

\begin{equation} \rmsdw{A}{B} = \sqrt{\frac{1}{\sum_i w_i} \sum_{i=1,\dots, N} w_i \vvnorm{a_i-b_i}^2} \end{equation}

Let $g$ a rigid motion from the special Euclidean group $SE(3)$. To perform a comparison of $A$ and $B$ oblivious to rigid motions, we use the so-called least RMSD [82] :

The vertex weighted $\lrmsd$ is defined by

\begin{equation} \lrmsdvw{A}{B} = \min_{g\in SE(3)} \rmsdw{A}{g(B)}. \end{equation}

The rigid motion yielding the minimum is denoted $\lrmsdoptrm{}(A,B)$ or $\lrmsdoptrm{}$ for short. The weight of the $\lrmsdvw$ is defined as $\rmsdW{vw}{A}{B} = \sum_i w_i$.


Note that the celebrated $\lrmsd$ is the particular case of the previous with unit weights:

\begin{equation} \lrmsd{A}{B} = \lrmsdvw{A}{B} \text{ with } w_i \equiv 1, \forall i. \end{equation}

We arrive at the main definition, which combines individual RMSD:

Consider two structures $A$ and $B$ for which non-overlapping regions $ \{ \motifacc{i}, \motifbcc{i} \}_{i=1,\dots,m}$ have been identified Assume that a $\lrmsd$ has been computed for each pair $(\motifacc{i}, \motifbcc{i} )$. Let $w_i$ be the weights associated with an individual $\lrmsd$. The combined $\rmsd$ is defined by

\begin{equation} \rmsdcomb{A}{B} = \sqrt{ \sum_{i=1}^m \frac{w_i}{\sum_i w_i} \lrmsds{ \motifacc{i}}{\motifbcc{i}} }. \end{equation}


Alignments. In the SBL, alignments are defined in the package Alignment_engines. For the purpose of this package, we consider structural alignments in two settings:

  • For two chains with identical sequence, the alignment is trivial as common residues are retained and aligned.
  • If this is not the case, a structural alignment is computed with Apurva .

Chain mappings for quaternary structures. Consider the case where one wishes to compare chains from different quaternary structures. In that case, one needs to know the correspondence between the individual chains across these structures. To this end, we define:

A mapping between the m chains of n proteins is given in matrix form as follows:
  • one line provides an ordered list of the $m$ protein chains. Phrased differently, two lines define a matching between the $m$ chains of these two proteins.
  • one column corresponds the chains of the $n$ proteins to be compared.


Functionalities and scenarios

Functionalities. The executable $\text{\sblrmsdflexprot}$ computes a vertex weighted $\rmsd$ for homologous proteins. A pairwise comparison therefore requires an alignment, as discussed below.

Scenarios. In the following, we apply $\rmsd$ calculations to several scenarios, which we classify as a function of the protein type (a single polypeptide chain (PC), or several PCs):

  • For one protein with quaternary structure, we compare the polypeptide chains.

  • For two proteins, we distinguish two cases: if the proteins have a single chain, a combined $\rmsd$ is reported; if they have several chains, a matrix is reported.

  • for $N\geq 2$ proteins, we also distinguish two cases, depending on the number of polypeptide chains,

    • for two chains, a matrix is reported. Nb: entries indexed by names of the individual structures.

    • for several chains, one matrix per chain is reported. Nb: again the matrix entries are indexed by the names of structures (filenames). Note that in this case, a mapping is required – see Def. def-chain-mapping .

In the sequel, we detail run examples for each of the scenarios.

Comparing n chains of one protein

Input

sbl-rmsd-flexible-proteins.exe --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains ABC -d results/onepnc -v -l
The main options of the program $\sblrmsdflexconf$ are:
–pdb-file string: PDB file
–domain-labels: Spec. file defining the subdomain labels
–chains: Which chains to load
-d: Output directory


File Name

Description

PDB file

Protein from the PDB: 1RER

Spec. file

Defines the subdomains as residue sequence number ranges

Input files for the run described in section Comparing n chains of one protein .

Output

PreviewFile Name

Description

Log file

Log file containing high level information on the run of $\sblrmsdflexconf$

Distance matrix

Distance matrix containing $\rmsdcomb$ between each chain

Output files for the runs described in section Comparing n chains of one protein .

Comparing two polypeptide chains

Input

  • A PDB file for each chain.
  • One spec file per chain for the sub-domain definitions (as in MolecularSystemLabelsTraits) for each chain.
$SBL_DIR/Applications/Molecular_distances_flexible/src/Molecular_distances_flexible/build/sbl-rmsd-flexible-proteins.exe --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains A --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains A -d results/twoponec -v -l --allow-incomplete-chains -p 3
The main options of the program $\sblrmsdflexprot$ are:
–pdb-file string: PDB file
–domain-labels: Spec. file defining the subdomain labels
–chains: Which chains to load
-d: Output directory
–allow-incomplete-chains: Load the polypeptide chains even if they are incomplete
-p: Set oocupancy policy


The example input files are no different from the ones described in Comparing n chains of one protein .

Output

PreviewFile Name

Description

Log file

Log file containing high level information on the run of $\sblrmsdflexprot$

Label distances

$\lrmsd$ between each of the defined subdomain labels

Output files for the runs described in section Comparing two polypeptide chains .

Comparing n chains of two proteins

Input

  • A PDB file for each protein,
  • A spec file for each protein.
$SBL_DIR/Applications/Molecular_distances_flexible/src/Molecular_distances_flexible/build/sbl-rmsd-flexible-proteins.exe --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains ABC --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains ABC -d results/twoponec -v -l --allow-incomplete-chains -p 3

Output

PreviewFile Name

Description

Log file

Log file containing high level information on the run of $\sblrmsdflexprot$

Distance matrix

$\rmsdcomb$ between each of N chains

Output files for the runs described in section Comparing n chains of two proteins .

Comparing n polypeptide chains

Input

  • A PDB file for each chain.
  • One spec file per chain for the sub-domain definitions (as in MolecularSystemLabelsTraits) for each chain.
$SBL_DIR/Applications/Molecular_distances_flexible/src/Molecular_distances_flexible/build/sbl-rmsd-flexible-proteins.exe --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains A --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains A --pdb-file data/pdb_files/EFF1.pdb --domain-labels data/spec_files/EFF1.spec --chains A -d results/nponec -v -l --allow-incomplete-chains -p 3

Output

We report a unique distance matrix containing the $\text{\rmsdcomb}$ for the $\text{$\dbinom{N}{2}$}$ comparisons.

PreviewFile Name

Description

Log file

Log file containing high level information on the run of $\sblrmsdflexprot$

Distance matrix

$\rmsdcomb$ between each of N proteins (one chain per protein)

Output files for the runs described in section Comparing n polypeptide chains .

Comparing m chains of n proteins

Input

Consider now a set of $n$ proteins, each involving $m$ chains defining the common quaternary structure.

We assume a a mapping between chains is provided – see Def. def-chain-mapping.

$SBL_DIR/Applications/Molecular_distances_flexible/src/Molecular_distances_flexible/build/sbl-rmsd-flexible-proteins.exe --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains ABC --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains ABC --pdb-file data/pdb_files/EFF1.pdb --domain-labels data/spec_files/EFF1.spec --chains ABC --chain-mapping data/mapping.txt -d results/npnc -v -l --allow-incomplete-chains -p 3
The main options of the program $\sblrmsdflexprot$ are:
–chain-mapping: The mapping between the chains which should be compared


File Name

Description

Chain mapping file

The format is describe above

Input files for the run described in section Comparing m chains of n proteins .

Output

We report $\binom{n}{2}(m+1)$ numbers, corresponding to $m+1$ distance matrices, stored in $m+1$ files:

  • For each column of the mapping, we report in a file the $\binom{n}{2}$ numbers corresponding to pairwise comparisons between the chains identified by the column.
  • The last file contains weighted lRMSD for all pairs of proteins. For a pair of proteins $P_i$ and $P_j$, we also mix the weighted lRMSD obtained by comparing their chains (pairwise).

PreviewFile Name

Description

Log file

Log file containing high level information on the run of $\sblrmsdflexprot$

Distance matrix

One file per chain mapping. Contains the distance matrix for the N proteins and a given chain.

Output files for the runs described in section Comparing m chains of n proteins .

Combined RMSD for conformations

Pre-requisites

Functionalities. The executable $\text{\sblrmsdflexconf}$ computes the $\text{\rmsdcomb}$ (vertex weighted) for conformations of a given protein. Note that a structural alignment is no longer necessary, instead the trivial identity alignment is computed. It behaves as follows:

  • Compares N conformations of the same molecule, on a pairwise basis. That is, for each pair, the $\lrmsd$ of specified structural units are computed, and combined into a $\rmsdcomb$.
  • If several chains are loaded (per conformation), a chain mapping is required – see Definition def-chain-mapping.

Note that the run scenarios are the same as the previous executable (see Pre-requisites). For example runs, the user is refered to Combined RMSD for proteins .

Combined RMSD for structural motifs

Pre-requisites

Structural motifs. As explained in the companion package Structural_motifs , structural motifs are regions showing a structural conservation higher than that of the structures defining them. For two structures, a motif is defined by two sets of a.a. in one-to-one correspondence – that is one set of a.a. on each structure.

Motif graph for overlapping motifs. When several motifs exist for two structures, an important question is to handle them coherently. Since motifs may overlap, we define:

(Motif graph) The motif graph of a list of motifs $\{ (\motifa{i}, \motifb{i}) \}_{i=1,\dots,p}$ is defined as follows: its node set is the union of the particles $A$ and $B$; its edge set is the union of two types of edges:
  • matching edges: the edges associated with the matchings defined by the motifs. NB: such edges are counted without multiplicity, that is, a matching edge present in several motifs is counted once.
  • motif edges: edges defining a path connecting all amino acids in a motif.
Consider a connected component (c.c.) of the motif graph. Restricting each c.c. to each structure yields two subgraphs. The set of all such subgraphs is denoted $\{ \motifacc{i}, \motifbcc{i} \}_{i=1,\dots,m}$.


Consider now the case where motifs have been defined for the two structures $A$ and $B$. We wish to compare $A$ and $B$ exploiting the information yielded by the connected components of the motif graph.

Consider the i-th c.c. of the motif graph. Let $e_i$ be the number of matching edges of this c.c. As usual, let $g(b_j)$ the position of atom $b_j$ from $\motifbcc{i}$ matched with atom $a_j$ from $\motifacc{i}$, upon applying a rigid motion $g$. We define:

The edge weighted $\lrmsdew$ of the i-th c.c. of the motif graph is defined by

\begin{align} \lrmsdew{\motifacc{i}}{\motifbcc{i}} &= \min_{g\in SE(3)} \sqrt{\frac{1}{e_i} \sum_{j=1}^{e_i} \vvnorm{a_j - g(b_j)}^2 } \end{align}

The rigid motion yielding the minimum is denoted $\lrmsdoptrm{i}$. The weight of the $\lrmsdew$ is defined as $\rmsdW{ew}{\motifacc{i}}{\motifbcc{i}} = e_i$.


This definition recalled, we note that the $\rmsdcomb$ from Eq. def-rmsd-comb generalizes, using edge weighted rather than vertex weighted $\rmsd$.

Input

The input requires two structures (PDB files) and a specification of motifs. The motifs are defined as an identifier followed by a list of aligned residues (example file below).

sbl-rmsd-flexible-motifs.exe --pdb-file data/pdb_files/SFV-1RER.pdb --chains A --pdb-file data/pdb-files/RVFV.pdf --motif-file data/motifs.txt --allow-incomplete-chains -p 3 -d results/motifs -v -l
The main options of the program $\sblrmsdflexmot$ are:
–pdb-file string: PDB file
–motif-file: Spec. file defining the motifs
–chains: Which chains to load


File Name

Description

PDB file

Protein from the PDB: 1RER

Motif file

Defines the motifs as lists of aligned residues

Input files for the run described in section Input .

Output

PreviewFile Name

Description

Log file

Log file containing high level information on the run of $\sblrmsdflexmot$

Motif graph

XML archive containing all the residues in the motif graph components

Output files for the runs described in section Input .

Programmer's Workflow

Pre-requisites

The implementation rationale behind the three executables is straightforward. Each executable has a workflow consisiting of one loader (SBL::IO::T_Protein_representation_loader, see Protein_representation) and one module. There are two different modules used among the three workflows :

  • SBL::Modules::T_Subdomain_comparator_module< ModuleTraits > is used in the $\sblrmsdflexconf$ and $\sblrmsdflexprot$ executables.
    It simply instantiates the pairwise comparisons between all the specified labels. In the former case, it uses the trivial identity alignment (from the ModuleTraits class), in the latter case, it uses Apurva.
  • SBL::Modules::T_RMSD_comb_for_motifs_module< ModuleTraits > is used in $\sblrmsdflexmot$. See Molecular_distances for more details.

Workflow example

The workflows are extremely basic. One out of the three is displayed below as an example.

T_Local_structural_comparison_workflow: