Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
|
Authors: F. Cazals and R. Tetley
Goals. The root mean square deviation (RMSD) and the least RMSD which are provided in the companion package Molecular_distances are two widely used similarity measures in structural bioinformatics. Yet, they stem from global comparisons, possibly obliterating locally conserved motifs.
To foster our understanding of molecular flexibility, this package, based upon developments presented in [40], provides so-called combined , which mixes independent measures, each computed with its own optimal rigid motion.
The structural units which may be domains, subdomain, SSE, etc, are defined using the machinery from the package MolecularSystemLabelsTraits
The combined can be used to compare (quaternary) structures based on motifs defined from the sequence (domains, SSE), or to compare structures based on structural motifs yielded by local structural alignment methods. To handle these situations, the following three executables are provided:
Modes. For the previous programs and as for the package Molecular_distances, four modes are provided:
As noticed above, regions are defined using the labels, as indicated in the package MolecularSystemLabelsTraits.
Two regions with the same label specification may not contain the same number of amino-acids, which typically happens in two cases:
In any case, for each label, a local alignment is run to ensure a 1-1 correspondence between a.a. In the SBL, alignments are defined in the package Alignment_engines.
We actually consider structural alignments in two settings:
We note that for PDB file containing structures of the same polypeptide chain – same SEQRES section, the numberings of amino acids in the ATOMS section are expected to be aligned. See primary-sequences-and-the-pdb-format for details.
The executable corresponding to this case is .
Vertex weighted . Consider two point sets and of size . Naturally, each point corresponds to an atom or pseudo-atom – which we generically call a particle.
Also consider a set of positive weights , meant to stress the importance of certain points. The weighted reads as
Let a rigid motion from the special Euclidean group . To perform a comparison of and oblivious to rigid motions, we use the so-called least RMSD [103] :
The rigid motion yielding the minimum is denoted or for short. The weight of the is defined as .
Note that the celebrated is the particular case of the previous with unit weights:
We arrive at the main definition, which combines individual RMSD:
Chain mappings for quaternary structures. Consider the case where one wishes to compare chains from different quaternary structures. In that case, one needs to know the correspondence between the individual chains across these structures. To this end, we define:
Functionalities. The executables and computes a vertex weighted for homologous proteins. A pairwise comparison therefore requires an alignment, as discussed below.
Scenarios. The scenarii discussed below depend on
Options to compare polypeptide chains. The programs and enjoy the same options, the main ones being:
Options to compare conformations. As noticed above, the identity alignment between residues can be used. The options are:
The reader is referred to the jupyter notebook – section Jupyter demo for illustrations.
This case is that of a protein with quaternary structure, in which case we compare the polypeptide chains.
Input.
Output. We report the matrix of distances.
Consider two proteins, with a focus on one chain for each.
Input.
Output. The is reported.
$SBL_DIR/Applications/Molecular_distances_flexible/src/Molecular_distances_flexible/build/sbl-rmsd-flexible-proteins.exe --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains A --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains A -d results/twoponec -v -l --allow-incomplete-chains -p 3
Input. Consider two proteins, each of which with n chains, specified as follows:
Output. A matrix of is reported – one entry for each pair of chains.
$SBL_DIR/Applications/Molecular_distances_flexible/src/Molecular_distances_flexible/build/sbl-rmsd-flexible-proteins.exe --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains ABC --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains ABC -d results/twoponec -v -l --allow-incomplete-chains -p 3
Input.
Output. The matrix of all pairwise distances is reported – one entry for each pair of chains.
$SBL_DIR/Applications/Molecular_distances_flexible/src/Molecular_distances_flexible/build/sbl-rmsd-flexible-proteins.exe --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains A --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains A --pdb-file data/pdb_files/EFF1.pdb --domain-labels data/spec_files/EFF1.spec --chains A -d results/nponec -v -l --allow-incomplete-chains -p 3
Input. Consider now a set of proteins, each involving chains defining the common quaternary structure.
We assume a a mapping between chains is provided – see Def. def-chain-mapping. The corresponding options are the following ones:
Output. We report numbers, corresponding to distance matrices, stored in files:
$SBL_DIR/Applications/Molecular_distances_flexible/src/Molecular_distances_flexible/build/sbl-rmsd-flexible-proteins.exe --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains ABC --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains ABC --pdb-file data/pdb_files/EFF1.pdb --domain-labels data/spec_files/EFF1.spec --chains ABC --chain-mapping data/mapping.txt -d results/npnc -v -l --allow-incomplete-chains -p 3
Functionalities. The executable computes the (vertex weighted) for conformations of a given protein. Note that a structural alignment is no longer necessary, instead the trivial identity alignment is computed. It behaves as follows:
Note that the run scenarios are the same as the previous executable (see Pre-requisites). For example runs, the user is refered to Combined RMSD for proteins .
Structural motifs. As explained in the companion package Structural_motifs , structural motifs are regions showing a structural conservation higher than that of the structures defining them. For two structures, a motif is defined by two sets of a.a. in one-to-one correspondence – that is one set of a.a. on each structure.
Motif graph for overlapping motifs. When several motifs exist for two structures, an important question is to handle them coherently. Since motifs may overlap, we define:
Consider now the case where motifs have been defined for the two structures and . We wish to compare and exploiting the information yielded by the connected components of the motif graph.
Consider the i-th c.c. of the motif graph. Let be the number of matching edges of this c.c. As usual, let the position of atom from matched with atom from , upon applying a rigid motion . We define:
The rigid motion yielding the minimum is denoted . The weight of the is defined as .
This definition recalled, we note that the from Eq. def-rmsd-comb generalizes, using edge weighted rather than vertex weighted .
The input requires two structures (PDB files) and a specification of motifs. The motifs are defined as an identifier followed by a list of aligned residues (example file below).
sbl-rmsd-flexible-motifs.exe --pdb-file data/pdb_files/SFV-1RER.pdb --chains A --pdb-file data/pdb-files/RVFV.pdf --motif-file data/motifs.txt --allow-incomplete-chains -p 3 -d results/motifs -v -l
The implementation rationale behind the three executables is straightforward. Each executable has a workflow consisiting of one loader (SBL::IO::T_Protein_representation_loader, see Protein_representation) and one module. There are two different modules used among the three workflows :
The workflows are extremely basic. One out of the three is displayed below as an example.
T_Local_structural_comparison_workflow:
We note in passing that the implementation of is done as follows: this package i.e. Molecular_distances_flexible defines in the file Structure_for_kpax.hpp the structure which is used to instantiate from package Iterative_alignment; in turn, uses alignment data structures from Alignment_engines.
See the following jupyter notebook:
We illustrate calculations involving the so-called combined RMSD or RMSD Comb. As test case, we use class II fusion proteins, decomposing each each polypeptide chain into 23 regions (see preprint by Tetley et al).
import os
import sys
import pdb
from SBL import SBL_pytools
from SBL_pytools import SBL_pytools as sblpyt
odir = "results-new"
sdirs = ["n-pc-one-protein", "two-pc", "n-pc-two-proteins", "n-pc", "m-pc-n-proteins", "motifs"]
for sdir in sdirs:
w = "%s/%s" % (odir, sdir)
if not os.path.exists(w):
os.system( ("mkdir -p %s" % w) )
def cmp_proteins_with_aligner_n_pc_one_protein(odir, aligner, aligner_tag):
osdir = "n-pc-one-protein"
cmd = "%s --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains ABC -d %s/%s -v -l" % (aligner,odir,osdir)
print(("Running %s" % cmd))
os.system(cmd)
ofn = "%s/%s/sbl-rmsd-flexible-proteins-%s__weighted_lrmsd.txt" % (odir,osdir,aligner_tag)
odir_osdir = "%s/%s" % (odir,osdir)
file_suffix = "%s__weighted_lrmsd.txt" % aligner_tag
sblpyt.show_text_file(file_suffix, odir_osdir)
aligner_kpax = "sbl-rmsd-flexible-proteins-kpax.exe" # kpax
cmp_proteins_with_aligner_n_pc_one_protein(odir, aligner_kpax, "kpax")
aligner_apurva = "sbl-rmsd-flexible-proteins-apurva.exe" # apurva
cmp_proteins_with_aligner_n_pc_one_protein(odir, aligner_apurva, "apurva")
def cmp_proteins_with_aligner_two_pc(odir, aligner, aligner_tag):
osdir = "two-pc"
cmd = "%s --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains A --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains A -d %s/%s -v -l --allow-incomplete-chains -p 3" % (aligner,odir,osdir)
print(("Running %s" % cmd))
os.system(cmd)
odir_osdir = "%s/%s" % (odir,osdir)
file_suffix = "%s__labels_lrmsd.txt" % aligner_tag
sblpyt.show_text_file(file_suffix, odir_osdir)
aligner_kpax = "sbl-rmsd-flexible-proteins-kpax.exe" # kpax
cmp_proteins_with_aligner_two_pc(odir, aligner_kpax, "kpax")
aligner_apurva = "sbl-rmsd-flexible-proteins-apurva.exe" # apurva
cmp_proteins_with_aligner_two_pc(odir, aligner_apurva, "apurva")
def cmp_proteins_with_aligner_n_pc_two_proteins(odir, aligner, aligner_tag):
osdir = "n-pc-two-proteins"
cmd = "%s --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains ABC --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains ABC -d %s/%s -v -l --allow-incomplete-chains -p 3" % (aligner,odir,osdir)
print(("Running %s" % cmd))
os.system(cmd)
odir_osdir = "%s/%s" % (odir,osdir)
file_suffix = "%s__weighted_lrmsd.txt" % aligner_tag
sblpyt.show_text_file(file_suffix, odir_osdir)
aligner_kpax = "sbl-rmsd-flexible-proteins-kpax.exe" # kpax
cmp_proteins_with_aligner_n_pc_two_proteins(odir, aligner_kpax, "kpax")
aligner_apurva = "sbl-rmsd-flexible-proteins-apurva.exe" # apurva
cmp_proteins_with_aligner_n_pc_two_proteins(odir, aligner_apurva, "apurva")
def cmp_proteins_with_aligner_n_pc(odir,aligner, aligner_tag):
osdir = "n-pc"
cmd = "%s --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains A --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains A --pdb-file data/pdb_files/EFF1.pdb --domain-labels data/spec_files/EFF1.spec --chains A -d %s/%s -v -l --allow-incomplete-chains -p 3" % (aligner,odir,osdir)
print(("Running %s" % cmd))
os.system(cmd)
odir_osdir = "%s/%s" % (odir,osdir)
file_suffix = "%s__weighted_lrmsd.txt" % aligner_tag
sblpyt.show_text_file(file_suffix, odir_osdir)
aligner_kpax = "sbl-rmsd-flexible-proteins-kpax.exe" # kpax
cmp_proteins_with_aligner_n_pc(odir,aligner_kpax, "kpax")
aligner_apurva = "sbl-rmsd-flexible-proteins-apurva.exe" # apurva
cmp_proteins_with_aligner_n_pc(odir,aligner_apurva, "apurva")
def cmp_proteins_with_aligner_m_pc_n_chains(odir,aligner, aligner_tag):
osdir = "m-pc-n-proteins"
cmd = "%s --pdb-file data/pdb_files/SFV-1RER.pdb --domain-labels data/spec_files/SFV.spec --chains ABC --pdb-file data/pdb_files/TBEV.pdb --domain-labels data/spec_files/TBEV.spec --chains ABC --pdb-file data/pdb_files/EFF1.pdb --domain-labels data/spec_files/EFF1.spec --chains ABC --chain-mapping data/mapping.txt -d %s/%s -v -l --allow-incomplete-chains -p 3" % (aligner,odir,osdir)
print(("Running %s" % cmd))
os.system(cmd)
odir_osdir = "%s/%s" % (odir,osdir)
file_suffix = "%s__weighted_lrmsd_chain_0.txt" % aligner_tag
sblpyt.show_text_file(file_suffix, odir_osdir)
file_suffix = "%s__weighted_lrmsd_chain_1.txt" % aligner_tag
sblpyt.show_text_file(file_suffix, odir_osdir)
file_suffix = "%s__weighted_lrmsd_chain_2.txt" % aligner_tag
sblpyt.show_text_file(file_suffix, odir_osdir)
aligner_kpax = "sbl-rmsd-flexible-proteins-kpax.exe" # kpax
cmp_proteins_with_aligner_m_pc_n_chains(odir,aligner_kpax, "kpax")
aligner_apurva = "sbl-rmsd-flexible-proteins-apurva.exe" # apurva
cmp_proteins_with_aligner_m_pc_n_chains(odir,aligner_apurva, "apurva")
# cmp proteins using motifs
#i################################################################################
def cmp_proteins_with_motifs():
## 4.2 Combined RMSD for structural motifs
osdir = "motifs"
aligner = "sbl-rmsd-flexible-motifs.exe"
cmd = "%s --pdb-file data/pdb_files/SFV-1RER.pdb --chains A --pdb-file data/pdb_files/RVFV.pdb --motif-file data/motifs.txt --allow-incomplete-chains -p 3 -d %s/%s -v -l" % (aligner,odir,osdir)
print(("Running %s" % cmd))
os.system(cmd)
cmp_proteins_with_motifs()