Structural Bioinformatics Library Template C++ / Python API for developping structural bioinformatics applications.
User Manual

# Molecular_distances_flexible

Authors: F. Cazals and R. Tetley

# Goals

Goal. The root mean square deviation (RMSD) and the least RMSD which are provided in the companion package Molecular_distances are two widely used similarity measures in structural bioinformatics. Yet, they stem from global comparisons, possibly obliterating locally conserved motifs.

To foster our understanding of molecular flexibility, this package, based upon developments presented in [29], provides so-called combined , which mixes independent measures, each computed with its own optimal rigid motion.

The structural units which may be domains, subdomain, SSE, etc, are defined using the machinery from the package MolecularSystemLabelsTraits

The combined can be used to compare (quaternary) structures based on motifs defined from the sequence (domains, SSE), or to compare structures based on structural motifs yielded by local structural alignment methods. To handle these situations, the following three executables are provided:

• to compare conformations of a molecule–which can be done without computing a non-trivial alignment,
• to compare homologous proteins,
• to compare structural motifs – see also the companion package Structural_motifs
As noticed above, regions are defined using the labels, as indicated in the package MolecularSystemLabelsTraits. In the sequel, it is assumed that regions involve 4 or more residues. If not, a warning is issued, and the output for the corresponding comparisons is set to -1.

Modes. As in the package Molecular_distances, four modes are provided:

• Calpha atoms: distances are computed on Calpha.
• Backbone atoms: only Calpha, C and N atoms.
• Heavy atoms: distances are computed on all atoms but hydrogen atoms.
• All atoms: distances are computed on all atoms, including hydrogen atoms.

# Combined RMSD for proteins

## Pre-requisites

Vertex weighted . Consider two point sets and of size . Naturally, each point corresponds to an atom or pseudo-atom – which we generically call a particle.

Also consider a set of positive weights , meant to stress the importance of certain points. The weighted reads as

Let a rigid motion from the special Euclidean group . To perform a comparison of and oblivious to rigid motions, we use the so-called least RMSD [82] :

The vertex weighted is defined by

The rigid motion yielding the minimum is denoted or for short. The weight of the is defined as .

Note that the celebrated is the particular case of the previous with unit weights:

We arrive at the main definition, which combines individual RMSD:

Consider two structures and for which non-overlapping regions have been identified Assume that a has been computed for each pair . Let be the weights associated with an individual . The combined is defined by

Alignments. In the SBL, alignments are defined in the package Alignment_engines. For the purpose of this package, we consider structural alignments in two settings:

• For two chains with identical sequence, the alignment is trivial as common residues are retained and aligned.
• If this is not the case, a structural alignment is computed with Apurva .

Chain mappings for quaternary structures. Consider the case where one wishes to compare chains from different quaternary structures. In that case, one needs to know the correspondence between the individual chains across these structures. To this end, we define:

A mapping between the m chains of n proteins is given in matrix form as follows:
• one line provides an ordered list of the protein chains. Phrased differently, two lines define a matching between the chains of these two proteins.
• one column corresponds the chains of the proteins to be compared.

## Functionalities and scenarios

Functionalities. The executable computes a vertex weighted for homologous proteins. A pairwise comparison therefore requires an alignment, as discussed below.

Scenarios. In the following, we apply calculations to several scenarios, which we classify as a function of the protein type (a single polypeptide chain (PC), or several PCs):

• For one protein with quaternary structure, we compare the polypeptide chains.

• For two proteins, we distinguish two cases: if the proteins have a single chain, a combined is reported; if they have several chains, a matrix is reported.

• for proteins, we also distinguish two cases, depending on the number of polypeptide chains,

• for two chains, a matrix is reported. Nb: entries indexed by names of the individual structures.

• for several chains, one matrix per chain is reported. Nb: again the matrix entries are indexed by the names of structures (filenames). Note that in this case, a mapping is required – see Def. def-chain-mapping .

In the sequel, we detail run examples for each of the scenarios.

## Comparing n chains of one protein

### Input

sbl-rmsd-flexible-proteins.exe --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains ABC -d results/onepnc -v -l

The main options of the program are:
–pdb-file string: PDB file
–domain-labels: Spec. file defining the subdomain labels
–chains: Which chains to load
-d: Output directory

 File Name Description PDB file Protein from the PDB: 1RER Spec. file Defines the subdomains as residue sequence number ranges
Input files for the run described in section Comparing n chains of one protein .

### Output

 Preview File Name Description Log file Log file containing high level information on the run of Distance matrix Distance matrix containing between each chain
Output files for the runs described in section Comparing n chains of one protein .

## Comparing two polypeptide chains

### Input

• A PDB file for each chain.
• One spec file per chain for the sub-domain definitions (as in MolecularSystemLabelsTraits) for each chain.


### Output

 Preview File Name Description Log file Log file containing high level information on the run of Distance matrix between each of N chains
Output files for the runs described in section Comparing n chains of two proteins .

## Comparing n polypeptide chains

### Input

• A PDB file for each chain.
• One spec file per chain for the sub-domain definitions (as in MolecularSystemLabelsTraits) for each chain.
$SBL_DIR/Applications/Molecular_distances_flexible/src/Molecular_distances_flexible/build/sbl-rmsd-flexible-proteins.exe --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains A --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains A --pdb-file data/pdb_files/EFF1.pdb --domain-labels data/spec_files/EFF1.spec --chains A -d results/nponec -v -l --allow-incomplete-chains -p 3  ### Output We report a unique distance matrix containing the for the comparisons.  Preview File Name Description Log file Log file containing high level information on the run of Distance matrix between each of N proteins (one chain per protein) Output files for the runs described in section Comparing n polypeptide chains . ## Comparing m chains of n proteins ### Input Consider now a set of proteins, each involving chains defining the common quaternary structure. We assume a a mapping between chains is provided – see Def. def-chain-mapping. $SBL_DIR/Applications/Molecular_distances_flexible/src/Molecular_distances_flexible/build/sbl-rmsd-flexible-proteins.exe --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains ABC --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains ABC --pdb-file data/pdb_files/EFF1.pdb --domain-labels data/spec_files/EFF1.spec --chains ABC --chain-mapping data/mapping.txt -d results/npnc -v -l --allow-incomplete-chains -p 3

The main options of the program are:
–chain-mapping: The mapping between the chains which should be compared

 File Name Description Chain mapping file The format is describe above
Input files for the run described in section Comparing m chains of n proteins .

### Output

We report numbers, corresponding to distance matrices, stored in files:

• For each column of the mapping, we report in a file the numbers corresponding to pairwise comparisons between the chains identified by the column.
• The last file contains weighted lRMSD for all pairs of proteins. For a pair of proteins and , we also mix the weighted lRMSD obtained by comparing their chains (pairwise).

 Preview File Name Description Log file Log file containing high level information on the run of Distance matrix One file per chain mapping. Contains the distance matrix for the N proteins and a given chain.
Output files for the runs described in section Comparing m chains of n proteins .

# Combined RMSD for conformations

## Pre-requisites

Functionalities. The executable computes the (vertex weighted) for conformations of a given protein. Note that a structural alignment is no longer necessary, instead the trivial identity alignment is computed. It behaves as follows:

• Compares N conformations of the same molecule, on a pairwise basis. That is, for each pair, the of specified structural units are computed, and combined into a .
• If several chains are loaded (per conformation), a chain mapping is required – see Definition def-chain-mapping.

Note that the run scenarios are the same as the previous executable (see Pre-requisites). For example runs, the user is refered to Combined RMSD for proteins .

# Combined RMSD for structural motifs

## Pre-requisites

Structural motifs. As explained in the companion package Structural_motifs , structural motifs are regions showing a structural conservation higher than that of the structures defining them. For two structures, a motif is defined by two sets of a.a. in one-to-one correspondence – that is one set of a.a. on each structure.

Motif graph for overlapping motifs. When several motifs exist for two structures, an important question is to handle them coherently. Since motifs may overlap, we define:

(Motif graph) The motif graph of a list of motifs is defined as follows: its node set is the union of the particles and ; its edge set is the union of two types of edges:
• matching edges: the edges associated with the matchings defined by the motifs. NB: such edges are counted without multiplicity, that is, a matching edge present in several motifs is counted once.
• motif edges: edges defining a path connecting all amino acids in a motif.
Consider a connected component (c.c.) of the motif graph. Restricting each c.c. to each structure yields two subgraphs. The set of all such subgraphs is denoted .

Consider now the case where motifs have been defined for the two structures and . We wish to compare and exploiting the information yielded by the connected components of the motif graph.

Consider the i-th c.c. of the motif graph. Let be the number of matching edges of this c.c. As usual, let the position of atom from matched with atom from , upon applying a rigid motion . We define:

The edge weighted of the i-th c.c. of the motif graph is defined by

The rigid motion yielding the minimum is denoted . The weight of the is defined as .

This definition recalled, we note that the from Eq. def-rmsd-comb generalizes, using edge weighted rather than vertex weighted .

## Input

The input requires two structures (PDB files) and a specification of motifs. The motifs are defined as an identifier followed by a list of aligned residues (example file below).

sbl-rmsd-flexible-motifs.exe --pdb-file data/pdb_files/SFV-1RER.pdb --chains A --pdb-file data/pdb-files/RVFV.pdf --motif-file data/motifs.txt --allow-incomplete-chains -p 3 -d results/motifs -v -l

The main options of the program are:
–pdb-file string: PDB file
–motif-file: Spec. file defining the motifs
–chains: Which chains to load

 File Name Description PDB file Protein from the PDB: 1RER Motif file Defines the motifs as lists of aligned residues
Input files for the run described in section Input .

## Output

 Preview File Name Description Log file Log file containing high level information on the run of Motif graph XML archive containing all the residues in the motif graph components
Output files for the runs described in section Input .

# Programmer's Workflow

## Pre-requisites

The implementation rationale behind the three executables is straightforward. Each executable has a workflow consisiting of one loader (SBL::IO::T_Protein_representation_loader, see Protein_representation) and one module. There are two different modules used among the three workflows :

• SBL::Modules::T_Subdomain_comparator_module< ModuleTraits > is used in the and executables.
It simply instantiates the pairwise comparisons between all the specified labels. In the former case, it uses the trivial identity alignment (from the ModuleTraits class), in the latter case, it uses Apurva.
• SBL::Modules::T_RMSD_comb_for_motifs_module< ModuleTraits > is used in . See Molecular_distances for more details.

## Workflow example

The workflows are extremely basic. One out of the three is displayed below as an example.

T_Local_structural_comparison_workflow: