Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.

Authors: F. Cazals and R. Tetley
Structural motifs. Structural alignment methods for polypeptide chains aim at identifying conserved structural motifs between structures.
Global structural alignment methods often fail to exploit local properties, i.e. locally conserved distances. Phrased differently, for similarity measures such as (see Molecular_distances and Molecular_distances_flexible), large values between two structures possibly obliterates regions of smaller size and with a much better structural conservation.
This package provides a method identifying locally conserved structural motifs shared by two polypeptide chains whose structure is known. In a nutshell, a structural motif consists in two sets of a.a., one on each structure, such that the of the motif is significantly smaller than that associated with a global alignment between the two structures:
Applications involving structural motifs. Motifs may be used to compute combined as done in Molecular_distances_flexible , and/or investigate the homology between proteins using profile hidden Markov models as done in FunChaT .
In the sequel, we recall the main concepts and the four steps used to identify motifs, referring the reader to [32] for the details.
For the sake of conciseness, we use structure and polypeptide chain interchangeably; likewise, structural motif and motif are used interchangeably.
Consider two polypeptide chains and of length and . Also consider a seed alignment between them; let be its length, namely the number of a.a. matched.
The alignment consists of pairs of a.a. mapped with one another, and can be computed using various methods. In this package, we use the following two aligners:
Denoting the distance between the carbons of indices i and j on chain , and likewise on chain , an assessment of the matching is provided by the distance difference matrix (DDM), a symmetric matrix whose entries, also called scores, are given by:
Note that these distances qualify the conservation of distances between between the two structures.
In mathematics, a filtration is a (finite) sequence of nested spaces [68] . Our method uses a filtration which may be the conserved distances (CD) filtration or the space filling diagram (SFD) filtration.
Conserved distances (CD) filtration. For each molecule, consider a graph whose vertices represent a.a., and edges are associated with all pairs of a.a (complete graph). Each edge is weighted by the score . Assume that edges have been sorted in ascending order.
To each edge weight–say , corresponds a graph containing only those edges whose weight is . In processing edges by increasing value, these graphs are nested. We call this sequence of graphs the conserved distances (CD) filtration.
Space filling diagram (SFD) filtration. Molecular models defined by union of balls are called space filling diagrams (SFD). Geometric and topological properties of such diagrams are encoded in the socalled complex (see Alpha_complex_of_molecular_model , as well as Space_filling_model_surface_volume ).
Upon sorting the scores by ascending value, define the absolute rank of a given a.a. as the smallest index of a score involving that . Finally, let the rank of a given a.a. be the index of its absolute rank (see details in [32] ).
For each molecule, to each index–say , corresponds a SFD containing only those a.a. whose index is . In processing indices by increasing value, these SFD are nested. We call this sequence of SFD the SFD filtration.
Persistence diagram. For either filtration, let us focus on connected components, which appear and disappear when inserting edges (CD filtration) or a.a. (SFD filtration). A unionfind data structure can be used to maintain these c.c. [52], and track in particular the birth date and the death date of each component.
The collection of all these points defines a 2D diagram called the persistence diagram [68] . Note that all points lie above the diagonal ; moreover, the distance of a point to the diagonal is the lifetime of the c.c. associated to that point.
To compute persistence diagrams, we use the Morse_theory_based_analyzer package.
As previously stated, for each structure we compute a persistence diagram, respectively denoted and .
Two points located in the same region of their respective PD have comparable birth and death dates. We identify such pairs of points. Denote and two such points. We say that and are comparable if the Euclidean distance between the two points (computed by using their respective birth and death dates as coordinates) is below a given threshold .
When two points are comparable, we identify motifs as follows:
First, we compute the connected components associated to the sublevel set of the graph (CD filtration), or SFD (SFD filtration).
Second, we perform a structural alignment between all pairs of connected components. We distinguish the cases of homologous proteins and that of conformations:
As detailed in [32] , motifs are filtered in two ways:
As detailed in [32], motifs can be used to compute the edgeweighted (see Molecular_distances_flexible). Additionally, structural motifs can be used to seed an iterative aligner (see the package Iterative_alignment).
As explained above, the detection of motifs hinges on two ingredients:
Combining these yields four methods:
These four methods are implemented within the following executables: each giving access to the CD and SFD filtrations:
We illustrate how to run the program on one executable since all behave in the same manner.
Input. In a nutshell:
Main options. The main options are:
Example command line for two conformations–launched from the demos directory:
sblstructuralmotifsconformations.exe distthreshold 10 lrmsdthreshold 0.5 sizediff 5 allowincompletechains pdbfile data/1SVB.pdb chains A pdbfile data/1URZ.pdb chains A directoryoutput expleconformationscd v 2
Example command line for two proteins–launched from the demos directory:
sblstructuralmotifschainskpax.exe motifsize 10 lrmsdthreshold 0.8 allowincompletechains p 4 pdbfile fatcat2003/1FXI.pdb chains A pdbfile fatcat2003/1UBQ.pdb chains A directoryoutput expleproteinscd v 2
Main output. The output provides statistics for all the structural motifs that were found along with a PyMol script that can be used to visualize them:
Rigid blocks: Two main blocks found by the executable Images generated using the outputted pymol scripts. 
To further investigate each structural motif, the executable outputs PyMol scripts. In order to run the script corresponding to, say motif 8, in the previous run, the user should follow these steps:
load data/1URZ.pdb load data/1SVB.pdb
run results/conformationssf/sblstructuralmotifsconformations__block_8_script.py
A screenshot of the PyMol window after running one of the outputted pymol scripts 
To be developed.
The package provides two python scripts:
sblstructuralmotifscomputer.py.
Consider a list of fourtuples, namely {(pdbid1, chain1, pdbid2, chain2)}, listed in a text file. The following call uses and/or (see option a) to perform a motif calculation for each fourtuple:
sblstructuralmotifscomputer.py i fatcat2003 f fatcat2003/fatcat2003small.txt a three
sblstructuralmotifsanalyzer.py.
The scatter plots for motifs indicate motif ids, and the associated Hasse diagram (one for each molecule) also indicates how motifs are nested. The following script processes a directory generated by the previous script, and proposes two analysis:
The following call lists the resids found in the motif whose id is 5:
dulferfcazals> ./sblstructuralmotifsanalyzer.py d results/apurva/1CID.pdbA2RHE.pdbA m 5 Motif file is results/apurva/1CID.pdbA2RHE.pdbA/sblstructuralmotifschainsapurva__persistent_motifs.xml Log file is results/apurva/1CID.pdbA2RHE.pdbA/sblstructuralmotifschainsapurva__log.txt Initial alignment size: 103.0 Initial alignment lRMSD: 3.944686 XML: 1 / 1 files were loaded ('Motif indices:', [1, 2, 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18]) ('Motif lrmsds:', [1.715135302837146, 3.027206911843852, 2.794978851169716, 3.03162145728718, 1.4790716539949267, 1.7351701619296112, 1.695979267778361, 2.9667660204140045, 3.0269885774433014, 2.689248812159066, 3.031806753459111, 2.739264842927472, 1.6532866267424344, 2.871028387233174, 3.031927685048948, 3.0744944632903914]) ('Motif sizes:', [34, 77, 47, 63, 12, 17, 19, 40, 68, 52, 75, 50, 32, 44, 66, 61]) List of the 12 resids in molecule 1 for motif index 5: [9, 10, 13, 16, 26, 31, 73, 83, 95, 96, 97, 98] 12 resids in molecule 2 for motif index 5: [14, 15, 18, 21, 32, 37, 73, 83, 97, 98, 99, 100] dulferfcazals>
The following call compares the lists of resids associated with two motif ids – which is especially useful to investigate the relationship between motifs of interest identified on the Hasse diagram of motifs:
dulferfcazals> ./sblstructuralmotifsanalyzer.py d results/apurva/1CID.pdbA2RHE.pdbA m 5 n 7 Motif file is results/apurva/1CID.pdbA2RHE.pdbA/sblstructuralmotifschainsapurva__persistent_motifs.xml Log file is results/apurva/1CID.pdbA2RHE.pdbA/sblstructuralmotifschainsapurva__log.txt Initial alignment size: 103.0 Initial alignment lRMSD: 3.944686 XML: 1 / 1 files were loaded ('Motif indices:', [1, 2, 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18]) ('Motif lrmsds:', [1.715135302837146, 3.027206911843852, 2.794978851169716, 3.03162145728718, 1.4790716539949267, 1.7351701619296112, 1.695979267778361, 2.9667660204140045, 3.0269885774433014, 2.689248812159066, 3.031806753459111, 2.739264842927472, 1.6532866267424344, 2.871028387233174, 3.031927685048948, 3.0744944632903914]) ('Motif sizes:', [34, 77, 47, 63, 12, 17, 19, 40, 68, 52, 75, 50, 32, 44, 66, 61]) List of the 12 resids in molecule 1 for motif index 5: [9, 10, 13, 16, 26, 31, 73, 83, 95, 96, 97, 98] 12 resids in molecule 2 for motif index 5: [14, 15, 18, 21, 32, 37, 73, 83, 97, 98, 99, 100] List of the 17 resids in molecule 1 for motif index 7: [9, 10, 13, 16, 25, 26, 28, 31, 60, 67, 73, 83, 84, 95, 96, 97, 98] 17 resids in molecule 2 for motif index 7: [14, 15, 18, 21, 31, 32, 34, 37, 59, 66, 73, 83, 84, 97, 98, 99, 100] [molecule1] Symmetric differences, sizes 0 12 5 Symmetric differences, items: ('A minus B:', []) ('A intersection B:', [9, 10, 13, 16, 26, 31, 73, 83, 95, 96, 97, 98]) ('B minus A', [25, 28, 60, 67, 84]) [molecule2] Symmetric differences, sizes 0 12 5 Symmetric differences, items: ('A minus B:', []) ('A intersection B:', [14, 15, 18, 21, 32, 37, 73, 83, 97, 98, 99, 100]) ('B minus A', [31, 34, 59, 66, 84])
The implementation is straightforward: each step described in the previous section is computed in a module, which are all linked together in the following workflow.
The package Structural_motifs uses two novel data structures, the filtration graphs. These were implemented in the classes:
These two classes allow to switch between the different types of filtration.
The package Structural_motifs is centered around two main new modules:
The package also uses the following modules:
See the following jupyter notebook:
Recall that the SBL provides three executables to mine motifs shared by two structures:
The main difference is that for conformations of the same molecule, we do not need to compute any alignment.
In the sequel, we focus on conformations  using the remaining two executables is similar.
from SBL import Structural_motifs
from Structural_motifs import *
from SBL import SBL_pytools
from SBL_pytools import SBL_pytools as sblpyt
#help(sblpyt)
# Structural motifs general options
smgo = Structural_motifs_general_options()
smgo.update_mode("conformations")
print("General options are: ", str(smgo))
# Prepare the 4tuple to be processed
smgo.set("input_dir", "fusionclassII")
data = ("1SVB.pdb", "A", "1URZ.pdb", "A")
# Create an instance of the algo and run it
sm_for_pairs = Structural_motifs_computer_for_pair(smgo)
res_triple = sm_for_pairs.process_two_structures( data )
res_triple.dump()
Socre plot. Recall that the Calpha ranks are obtained from a structural distance score, namely the distance difference between two Calpha carbons. If CA_i and CA_j are aligned to CA_i' and CA_j' respectively, the distance difference is s_ij = d_ij  d_i'j' where d_ij = CA_i  CA_j. We can plot the s_ij scores against their respective Calpha ranks.
Calpha distance plot. For a given Calpha rank we can recover the corresponding Calpha carbons CA_i and CA_j. We can then plot CA_i  CA_j (or CA_i'  CA_j' for the second structure) against the Calpha rank. This gives us an indication on how the atoms are selected on the structure. Notice the distinctive sawtooth pattern indicating that atoms can be far apart and we do not necessarily process atoms adjacent to each other on the backbone.
Sequence shift plot. For the same CA carbons (CA_i and CA_j), we assume their residue sequence numbers are i and j. We can then plot i  j (i'  j' for the second structure) against the Calpha ranks. This gives us an indication on their proximity in the sequence. Once again we notice a distinctive sawtooth pattern.
# Extract statistics and run plots
(stats_size_cd, stats_lrmsd_cd) = Structural_motifs_renderer.render_pairwise_comparison_one_aligner(res_triple)
Sticking to our two original structures (conformations of the TickBorne Enciphalitis virus envelope protein), we study the incidence of the filtration on the results. To this end, we rerun the previsous calculations passing with the usesfdfiltration option.
from SBL import Structural_motifs
from Structural_motifs import *
from SBL import SBL_pytools
from SBL_pytools import SBL_pytools as sblpyt
#help(sblpyt)
# Structural motifs general options
smgo = Structural_motifs_general_options()
smgo.update_mode("conformations")
smgo.set("use_sfd", True) # Use space filling diagram instead of conserved distances
# Prepare the 4tuple to be processed
smgo.set("input_dir", "fusionclassII")
data = ("1SVB.pdb", "A", "1URZ.pdb", "A")
# Create an instance of the algo and run it
sm_for_pairs = Structural_motifs_computer_for_pair(smgo)
res_triple = sm_for_pairs.process_two_structures( data )
res_triple.dump()
# Stats and plots
(stats_size_sfd, stats_lrmsd_sfd) = Structural_motifs_renderer.render_pairwise_comparison_one_aligner(res_triple)
We can then proceed to compute pareto envelopes and plot the stats for both methods on the same figure.
def compute_pareto_envelope(lrmsd_stats, size_stats):
#create motif points
motif_points = [[lrmsd_stats[i], size_stats[i]] for i in range(0, len(lrmsd_stats))]
#sort motifs by increasing lRMSD
motif_points = sorted(motif_points, key=lambda motifs: motifs[0])
#find dominated points
dominated_points = []
for i in range(0, len(motif_points)):
for j in range(i+1, len(motif_points)):
if(motif_points[i][1] >= motif_points[j][1]):
dominated_points.append(motif_points[j])
#exclude them from the pareto envelope
pareto_envelope = list(zip(*[motif for motif in motif_points if (motif not in dominated_points)]))
motif_points = list(zip(*motif_points))
# just for plotting purposes (so the limits are will defined)
min_lrmsd = min(pareto_envelope[0])
pareto_envelope[0] = [min_lrmsd] + list(pareto_envelope[0])
pareto_envelope[1] = [0] + list(pareto_envelope[1])
return motif_points, pareto_envelope
def pareto(cd_lrmsd_stats, cd_size_stats, sfd_lrmsd_stats, sfd_size_stats):
#just to define the bounds of the pareto envelope (for esthetic purposes)
max_lrmsd = max(max(cd_lrmsd_stats), max(sfd_lrmsd_stats))
max_size = max(max(cd_size_stats), max(sfd_size_stats))
cd_motif_points, cd_pareto_env = compute_pareto_envelope(cd_lrmsd_stats, cd_size_stats)
sfd_motif_points, sfd_pareto_env = compute_pareto_envelope(sfd_lrmsd_stats, sfd_size_stats)
cd_pareto_env[0].append(max_lrmsd+5)
cd_pareto_env[1].append(max(cd_pareto_env[1]))
sfd_pareto_env[0].append(max_lrmsd+5)
sfd_pareto_env[1].append(max(sfd_pareto_env[1]))
plt.plot(cd_pareto_env[0], cd_pareto_env[1])
plt.scatter(cd_motif_points[0], cd_motif_points[1])
plt.plot(sfd_pareto_env[0], sfd_pareto_env[1], color='red')
plt.scatter(sfd_motif_points[0], sfd_motif_points[1], color='red')
plt.xlabel('lRMSD')
plt.ylabel('Size')
plt.show()
pareto(stats_lrmsd_cd, stats_size_cd, stats_lrmsd_sfd, stats_size_sfd)
print("Done")
The executable also outputs a number of PyMol script to view motifs directly. Each motifs has a unique identifier and the scripts follow the same nomenclature: to view motif 31 look for the '_block_31_script.py' file.
To run the script from PyMol simply load the two structures and type 'run name_of_the_script.py' Below is the image of two motifs yielded from the previous comparisons. As you can see our method allows to recover the blocks which are invariant in the conformational changes.
from IPython.display import Image
display(Image(filename ="fig/motif_51.png", width=500, height=500));
display(Image(filename ="fig/motif_54.png", width=500, height=500));