Spectral_domain_explorer

Authors: F. Cazals and J. Herrmann and E. Sarti

Introduction: from Spectrus to Spectraldom

Defining domains within a polypeptide chain or protein amounts to maximizing intra-domain interactions while minimizing inter-domain interactions. This problem is closely related to graph partitioning techniques, and in particular spectral clustering [171].

The $\text{\spectrus}$ algorithm from [145] proposes a solution to this problem using spectral partitioning applied to a coarse-grain model involving $\Calpha$ carbons. Consider pairs of residues whose distance $\dij$ (the distance between the $\Calpha$ carbons) is less than some threshold $\dmax$ . For these residues, assume one is able to compute a pairwise fluctuation

akin to the standard deviation of their mutual distance:

$\begin{equation} f_{i,j} = \sqrt{ \brave{d_{i,j}^2} - \brave{d_{i,j}}^2 }. \end{equation}$

This fluctuation is turned into a similarity $\wij$ , and the corresponding matrix $W=(\wij)_{ij}$ is passed to spectral clustering to infer a predefined number of domains/clusters – a number which is subsequently optimized.

This package implements $\text{\spectraldom}$ [48], which simplifies and extends the original $\text{\spectrus}$ algorithm, offering three modes:

Spectraldom_ENM: the original $\text{\spectrus}$ mode [145], which applies to a single molecule. The fluctuation of Eq. eq-fluct is derived from an elastic network model (ENM) and the associated normal modes [13].

Spectraldom_DM: this new diffusion maps (DM) mode which simplifies the previous mode by plainly defining the similarity matrix from stiffness constants associated to pairwise covalent and non-covalent interactions. That is, the stiffnesses are directly used as weight/similarity matrix for spectral clustering, removing the calculations of normal modes.

Spectraldom_MSA: this new Multiple Sequence Alignment mode, to handle a collection of conformations of homologous proteins. In this setting, the fluctuation of Eq. eq-fluct is computed from the atomic coordinates of $\Calpha$ carbons provided by the structures.

Using Spectraldom

Single structure: modes ENM and DM

Background. We first provide some key elements.

$\text{\spectraldom}$ proposes the aforementioned three modes (ENM, DM, MSA), which can be ascribed to two major modes: single structure (ENM, DM), and multiple structures (MSA).

Modes ENM and DM require a PDB/mmCIF file and the specification of a chain. These two modes use stiffness constants to encode pairwise contacts between residues. The default values are as follows:

Covalent contacts: $\stiffc{ij} =10$ ;
Non-covalent contacts: $\stiffnc{ij}=1$ ;
Hydrogen bonding: $\stiffhbond{ij}=1/2$ .

Note that two amino acids may interact in different ways. For example, two a.a. on the same generatrix of a $\alpha$ -helix interact non-covalently, but also via hydrogen bonding. We therefore define two modes to define the weight $\wij$ of two residues with multiple interaction types, the $\text{\spectraldomWmax}$ (resp. $\text{\spectraldomWsum}$ ) mode in which we take the max (resp. the sum) of individual stiffnesses. In our helix example, one gets $\gamma_{ij} = \max(\stiffnc, \stiffhbond)$ , and $\gamma_{ij} = \stiffnc+\stiffhbond$ respectively.

Mode MSA requires a multiple sequence alignment in FASTA format, as well as a file giving access to the structures processed. A position in the MSA is termed valid provided that no dash is present in that column, and all residues have been resolved in that structure. See below for details.

For a fixed value of , a score $\spectrusscore{k}{n}$ is compute–see details below. This score is an assessment of the corresponding clustering with respect to a null model. The decomposition yielded by $\text{\spectrus}$ is not deterministic since the $\text{\kmeanspp}$ with random initialization is internally used [12]. The stability assessment is obtained via a quality plot , namely the boxplot $\spectrusscore{k}{n}$ obtained for a fixed number of runs/repeats of $\text{\kmeanspp}$ .

The main options of $\text{\spectraldom}$ are:

-f arg string: Filepath to PDB or mmCIF file
–load-chains string: Chain of the PDB or mmCIF file to be processed
–mode arg (=DM) string: Mode: MSA, DM, ENM
–msa-fpath arg string: Filepath for the MSA in fasta format
–structures-fpath arg string: Filepath of the file containing a list of pairs filepath of PDB/mmCIF file + associated chain
-r [ –result-dir ] arg string: Output directory
–dmax-contacts arg (=10) string: Dmax distance threshold for non covalent contacts
–stiffness-cov arg (=10) string: Stiffness for covalent contacts
–stiffness-ncov arg (=1) string: Stiffness for non-covalent contacts
–stiffness-hbond arg (=0.5) string: Stiffness for hydrogen bonds
–sse string: Use SSE info from PDB or mmCIF header to define h-bonds
–stiffness-mode arg (=max) string: Stiffness mode: sum or max
–dzero arg (=4.5) string: Distance d0 used to soften the stiffnesses–Eq eq-gamma-soft
–sigma arg (=0.5) string: Stdev used to soften the stiffnesses–Eq eq-gamma-soft
–kmin arg (=2) string: Min num of clusters
–kmax arg (=10) string: Max num of clusters

NB: run $\text{\sblspectraldom}$ to get all options.

Option structures-fpath takes as arg the filepath of the file containing a list of pairs filepath of PDB/mmCIF file + associated chain. Here is an example of such a file

>a/b/c/1ggg.pdb_A ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK >a/b/c/1wdn.pdb_A ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK

Output. We noted above that $\text{\spectraldom}$ internally resorts to runs of $\text{\kmeanspp}$ . This said, the output is as follows:

Two csv files containing scores: scores_repeats.csv, i.e. the scores obtained for all repeats and all values of ; and scores_best.csv: the best scores for each value of – thus values.

As illustrated in the notebook below, each line contains 5 numbers:

k,distance_stdev,ave_ratio_distances_NN1_NN2,norm_ave_ratio_distances_NN1_NN2,BIC

For each structure and each value of , one text file spectraldom-clustering-pdbid-chaind-k.txt containing a list of pairs $(\text{resid} \in 1..\text{num\_valid\_residues}, \text{cluster\_index} \in 0..k-1)$ . As detailed below, valid residues are defined using - in the MSA and missing residues in structures.

For each structure and each value of , a VMD visualization state grouping all residue of the same domain/cluster in the same representation. We also create a representation, in CPK mode, for those resides which are present in a structure, but are invalidated by the MSA.

Algorithm

We now review the main steps of the algorithm, see [145] and [48]. The user is also referred to Section Implementation for more details on the C++ classes involved.

Building contact maps

We record contacts between pairs of residues, identified by their sequence numbers.

Covalent contacts (CT_COV). These contacts are directly derived from the covalent graph. The stiffness constant associated with a covalent contact is denoted $\stiffc (=10)$ .

Non-covalent contacts (CT_NCOV). We take for granted a method to report neighoring heavy atoms in a 3D structure, via range searching. A distance cutoff applies, see below. The stiffness constant associated with a covalent contact is denoted $\stiffnc (=1)$ .

Hydrogen-bonding contacts (CT_HBOND). These contacts are retrieved from secondary structures provided in the header of the PDB/mmCIF file processed. The corresponding stiffness constant is denoted $\stiffhbond{ij} (=1/2)$ .

Other types of contacts can easily be accommodated: CT_SS: contacts for disulfide bonds, CT_SALTB: contacts for salt bridges, etc.

Building the similarity matrix

We now explain how the weights $\wij$ of the similarity matrix used for spectral clustering are computed.

Mode Spectraldom_ENM. Following [13], an elastic network model (ENM) is built from covalent and non covalent contacts. The corresponding normal modes are computed, from which distance fluctuations for pairs of residues are obtained [13]. Denote $\sigma$ the fluctuation stdev over all pairs. The fluctuation $f_{ij}$ is turned into the similarity $w_{ij}$ as follows:

$\begin{equation} \wij = \expL{ -\frac{f_{ij}^2} { 2 \sigma^2}}. \end{equation}$

Mode Spectraldom_DM. As noted above, each pair of neighboring / interacting residues is assigned one or several stiffness values. In case of duplicate values for two residues, the max value is used – mode $\text{\spectraldomWmax}$ . To alleviate the incidence of the distance threshold $\dmax$ used to identify neighboring residues, each stiffness constant is softened as follows–with $d_0(=4\ANG)$ a reference distance:

$\begin{equation} \wij = \gamma_{ij}*exp( -\frac{\fabs{\dij-d_0}}{\sigma} ). \end{equation}$

We take $d_0 =4.5\ang$ as typical distance between consecutive $\Calpha$ s along the backbone, and $\sigma=1/2$ .

The corresponding matrix is directly used by the spectral clustering algorithm.

Mode Spectraldom_MSA. The multiple sequence alignment (MSA) mode handles the case of molecules with similar structures and disimilar sequences. Given a MSA, the atomic distance fluctuation of Eq. eq-fluct is computed from the distances observed in the structures.

Spectral clustering for a fixed number k of clusters/domains

The similarity matrix is used to perform spectral clustering, see [135] and [171] .

Consider the edge weighted undirected graph whose vertices are the individual residues, and whose edge weights are the $\wij$ . Define the degree of a vertex as the sum of similarities for incident edges, and let be the corresponding diagonal matrix.

Consider the usual unnormalized Laplacian . The clustering method uses the eigenvectors associated with the lowest eigenvalues of the symmetric Laplacian $\lapSym = D^{-1/2} L D^{-1/2}$ .

Then, one clusters the normalized rows of the $n\times k$ matrix of eigenvectors truncated to their first coordinates. This clustering step on the unit sphere $\Sd{k-1}$ yields clusters.

The original $\text{\spectrus}$ uses k-medoid instead of k-means [145] . We use $\text{\kmeanspp}$ instead, implemented in a generic fashion to accommodate points on the unit sphere $\Sd{k-1}$ . Note that in this generic version, the distance between a data point and a cluster center is the length of the shortest great circle arc joining them. Likewise, a center of mass used in k-means in constrained to lie on the unit sphere $\Sd{k-1}$ . The generic implementation of $\text{\kmeanspp}$ summarizes all scores of a clustering within the class SBL::GT::K_means_scores.

Optimal number of clusters

Fixed value of k. For a fixed value of , we assess the quality of the clustering using the fitness score used in [145] .

Assume there are residues. Spectral clustering requires clustering unit vectors on the sphere $\Sd{k-1}$ . Recall that in $\text{\kmeans}$ or k-medoids, one assigns a data point to the nearest center. To assess whether this assignment is stable, one computes the ratio of distances to the first and second nearest centers. (When clusters are well separated, this ratio should be small.) This average ratio for points is denoted $\kmratiokn{k}{n}$ . To compare this value to a null model, let $\kmratiorefkn{k}{n}$ be the same computed for random points $\Sd{k-1}$ – these reference values are called prefactors in the sequel. The $\text{\spectrus}$ fitness score retained for that value of reads as

$\begin{equation} \spectrusscore{k}{n} = \frac{\kmratiokn{k}{n}}{\kmratiorefkn{k}{n}}. \end{equation}$

Repeats. For a fixed value of k, we perform repeats, and retain the clustering with the best fitness score.

Range for k and repeats. Consider now a range of values for $k \in [\kmin, \kmax]$ . The previous step is iterated to optimize the number of domains in this range. Moreover, for each $k \in [\kmin, \kmax]$ , a default number of repeats clusterings is performed and the best is retained for that value of .

Quality plot. The quality plot is the box plot of the score Eq. eq-spectrusscore.

On the quality plots, the values of

such that primary sequence size divided by the number of domains is less than 35 amino acids (default size for a helix) are shaded, as decomposing a structure into SSE is not the goal pursued.

Hyper-parameters: comments

The following comments are in order regarding all parameters used:

Due to Eq. eq-gamma-soft, the segmentation obtained is robust with respect to the values of $\stiffc{ij}, \stiffnc{ij}, \stiffhbond{ij}$ , and also to those of and $\sigma$ .

On the other hand, the value of $\dmax$ used to identify neighboring residues matters, and a default value $\dmax = 10$ is used. Smaller values yield sparse graphs which favor an over-segmentation.

Implementation

The implementation requires a diverse set of pieces, which we now briefly present.

Main classes

Filters for residues in structures. The class SBL::CSB::T_Chains_residues_contiguous_indexer is used to record the residues present in the structures associated with the sequences of the MSA. Contiguous indices for resids are the contiguous integers which reference the resids, irrespective of gaps.

Multiple sequence alignments. Consider a MSA of length , and assume each sequence has length . We recall selected notions from the package Alignment_engines.

In order for atomic fluctuations to be computed using the same number of pairs (i.e. the binomial coefficient $\binom{\text{num. sequences}}{2}$ ), we define:

A MSA position is valid provided there are no dashes in a MSA column. The number of valid positions is denoted $N_v\leq N$ .

The class SBL::CSB::T_Multiple_sequence_alignment_DS is the data structure to handle MSA; it provides in particular the following two mappings:

first, alignment indices to sequence indices, i.e. maps from to
second, alignment indices to valid sequences indices, i.e. maps from to .

The previous class overlooks the presence of residues in the structures associated to the sequences of the MSI.

The class SBL::CSB::T_Multiple_sequence_alignment_DS_filtered has the same interface as SBL::CSB::T_Multiple_sequence_alignment_DS, except that it implements an extra check based on residues associated with a MSA position:

Consider a MSA and the associated structures. A position of the MSA is valid if it is valid if the sense of Def. def-valid-msa-bis with no gap tolerated, and if the residue associated to a MSA position is present in every structure.

In practice, this latter check is provided by the class SBL::CSB::T_Chains_residues_contiguous_indexer. (Recall that this class contains a map from a residue sequence number (minus 1) to contiguous indices.)

Representation of polypeptide chains. The class SBL::CSB::T_Polypeptide_chain_representation from the package Protein_representation is used to access all pieces of information on polypeptide chains (covalent neighbors, atomic coordinates).

Contacts. Given a polypeptide chain, consider the following classes of contacts, defined as interacting residues

CT_COV: covalent contacts
CT_NCOV: non covalent contacts
CT_HBOND: contact corresponding to hydrogen bonds
CT_SB: salt bridges
CT_SS: disulfide bonds

The class SBL::CSB::T_Polypeptide_chain_contacts_finder finds all such contacts, and provides them as pairs of residue sequence numbers –in (min, max) order. The search of non covalent contacts uses the spatial search engine from CGAL, based on Kd trees.

Instantiating this class uses the local class SBL::CSB::T_Point_xyz_with_atom , which contains the point used for spatial search (CGAL), and also knows the atom from the polypeptide chain representation.

Elatic Network Model and pairwise interactions. Given one polypeptide chain, the class SBL::CSB::T_Elastic_network_model implements the ENM model from [13] , providing in particular a calculation of anisotropic fluctuations associated with normal modes.

As usual, the structure processed may have missing residues at the C-ter and/or N-ter, and also within the structure. Also, its residue sequence numbers may start with negative values. On the other hand, indices in Eigen start at 0. Therefore, residue sequence numbers are converted to contiguous indices using the class SBL::CSB::T_Chains_residues_contiguous_indexer.

An ENM model can be seen as a graph whose vertices are residues, connected by edges.

The $\Calpha$ s are collected and inserted into a vector indexed by contiguous indices.

Contacts are sought between residue identified by their sequence numbers. These contacts are also converted to contiguous indices which can be used in Eigen matrices – Eigen indices start at 0.

Linear algebra operations are carried out using the Eigen library.

Clustering and spectral clustering. Spectral clustering is implemented in the class SBL::GT::T_Spectral_clustering, which is parameterized by a traits class provide an edge weighted boost graph. The weight $\wij$ associated with an edge is used to define the Laplacian and weighted Laplacians of this graph [171] .

The class SBL::GT::T_Spectral_clustering internally uses the generic version of $\text{\kmeanspp}$ which handles points on the unit sphere $\Sd{k-1}$ , via the class SBL::GT::T_Cluster_engine_k_means .

As a sanity check, spectral clustering uses the post-condition based on Eigen::Success to ensure that the computation of eigenvalues/vectors succeeded. Naturally, the corresponding assert is discarded for optimized compiled codes.

Prefactors for kmeans. Prefactors are the denominators involved in Eq. eq-spectrusscore . They are simply computed by running $\text{\kmeanspp}$ for random points on the unit sphere $\Sd{k-1}$ , for various values of . In practice, a table is stored in a cpp file at compile time, see below.

Spectraldom. We finally arrive at the class T_Spectraldom which implements the aforementioned three modes. Given that the DM and ENM modes use a single structure but not MSA, we use the following design pattern to handle the three modes coherently and comply with Defs. def-valid-msa-bis and def-valid-msa-struct-bis :

Single structure for modes ENM and DM. When loading the structure and its $\Calpha$ carbons, the corresponding sequence is built – with X letters for residues missed in the structure. A filter instantiating the class SBL::CSB::T_Chains_residues_contiguous_indexer and storing the unique reconstructed sequence is passed to the derived class SBL::CSB::T_Multiple_sequence_alignment_DS_filtered.

Multiple structures for the mode MSA. An instance of type SBL::CSB::T_Chains_residues_contiguous_indexer is also used to record missing residues of all structures.

The Spectraldom traits class requires two types:

Polypeptide_chain_representation : default class used is SBL::CSB::T_Polypeptide_chain_representation,

Multiple_sequence_alignment : default class used is SBL::CSB::T_Multiple_sequence_alignment_DS_filtered.

Back to structures from clusters. Let be the number of valid indices – Def. def-valid-msa-struct-bis. Spectral clustering returns clusters containing indices in the range . One obtains the residue_sequence_numbers using the reverse maps provided by the class SBL::CSB::T_Multiple_sequence_alignment_DS .

Programs and scripts provided

Main executable: $\text{\sblspectraldom}$ : the main program implementing $\text{\spectraldom}$ , as described above.

Main script: sbl-spectraldom-run.py: a script calling $\text{\sblspectraldom}$ , and also providing a number of post-processing steps, including (i) the generation of the quality plot (ii) the generation of visualization files for VMD, and (iii) the generation of spec files to compute $\text{\rmsdcomb}$ using the package Molecular_distances_flexible .

Reducing fragmentation: sbl-spectraldom-defrag.py: a script used to remove fragmentation of the domains identified. See paper [48] for details.

Handling prefactors. To compute and manipulate prefactors involved in the scores $\spectrusscore{k}{n}$ , the package provides

sbl-spectraldom-generate-prefactor.cpp/exe : runs $\text{\kmeanspp}$ for random points on a sphere of various dimension, and generates a csv file containing the prefactors;

sbl-spectraldom-prefactor-header-writer.py : using the previous csv file, generates kmeans-prefactors.hpp, which is used to compile $\text{\sblspectraldom}$ .

Jupyter demo

See the following jupyter demos:

Jupyter notebook file
Spectral_domain_explorer
Jupyter notebook for the SBL package Spectral_domain_explorer¶
This notebook shows how to lauch the commands and comments on the outpout. It shows the quality plot obtained, but does not perform an interactive visualization of the structures -- which should be done with VMD or pymol. To run the notebook, a number of executables / scripts from the SBL must be available from one's PATH:

To find domains:

(package Spectral_domain_explorer) sbl-spectraldom.exe

To perform flexible alignments

(package Kpax) : sbl-kpax.exe

To compare the domains obtained against structural similarities identified by the flexible aligner Kpax:

(package Molecular_distances_flexible) : sbl-rmsd-flexible-proteins-kpax.exe

To reduce the fragmentation of domains using the D-family-matching algorithm

(package D_family_matching) : sbl-cmp-clust-dfam.exe
In [1]:
import os, subprocess from SBL_pytools import SBL_pytools py = subprocess.run("which python3", shell=True, stdout=subprocess.PIPE).stdout.decode('utf-8').strip() spectraldom_run_exe = subprocess.run("which sbl-spectraldom-run.py", shell=True, stdout=subprocess.PIPE).stdout.decode('utf-8').strip()
I. Example for the diffusion map (DM) mode¶
In [2]:
# We use sbl-spectraldom-run.py to run sbl-spectraldom.exe on chain A of 1ggg.pdb # NB: the script dumps the command invoking sbl-spectraldom.exe cmd = "%s %s --pdb-fpath data/DB-molecular-motions/1wdn.pdb --load-chains A --output-dir 1ggg-1wdn" % (py, spectraldom_run_exe) #cmd = "sbl-spectraldom-run.py --pdb-fpath data/DB-molecular-motions/1wdn.pdb --load-chains A --output-dir 1ggg-1wdn" print(cmd) os.system(cmd) ofile = SBL_pytools.find_file_in_output_directory("quality-score-plot.pdf", "1ggg-1wdn") print("Quality plot file:", ofile) SBL_pytools.show_pdf_file(ofile, 200)
/usr/bin/python3 /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-spectraldom-run.py --pdb-fpath data/DB-molecular-motions/1wdn.pdb --load-chains A --output-dir 1ggg-1wdn Pdb files found : Collected data/DB-molecular-motions/1wdn.pdb A +++ Running spectraldom in mode Spectraldom_DM Spectraldom mode DM/ENM: valid length and num residues are 223 223 residues ->Polypeptide_chain_contacts_finder... ...covalent_contacts num contacts 445 ...non_covalent_contacts contacts with distance threshold 10.00 is 4148 ->ENM::convert_contact_maps_to_triplets_ijstiff... ...stiffness CT_COV=10.00 : processing 445 contacts ...stiffness CT_NCOV=1.00 : processing 4148 contacts ...ENM check: num Calphas is 223 ...ENM::fill_DM_weights_matrix_from_residue_contacts in stiffness mode Stiffness_mode_max, matrix size 223 x 223 End compute_graph with 223 vertices and 4148 edges Spectral_clustering::compute_least_k_eigenvals ...Spectraldom: running k_means for k = 2 3 4 5 6 7 8 9 10 ...Spectraldom: dumping scores /home/fcazals/projects/proj-soft/sbl/Applications/Spectral_domain_explorer/demos/Spectral_domain_explorer/1ggg-1wdn/scores-best.csv ...Spectraldom: dumping scores /home/fcazals/projects/proj-soft/sbl/Applications/Spectral_domain_explorer/demos/Spectral_domain_explorer/1ggg-1wdn/scores-repeats.csv ---Spectraldom done ---Spectraldom done
/usr/local/lib64/python3.11/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2187. warnings.warn(
['/user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-spectraldom.exe', '--kmin', '2', '--kmax', '10', '--dmax-contacts', '10', '--mode', 'DM', '--filename', 'data/DB-molecular-motions/1wdn.pdb', '--load-chains', 'A', '--stiffness-cov', '10', '--stiffness-ncov', '1', '--dzero', '4.5', '--output-dir', '1ggg-1wdn'] /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-spectraldom.exe --kmin 2 --kmax 10 --dmax-contacts 10 --mode DM --filename data/DB-molecular-motions/1wdn.pdb --load-chains A --stiffness-cov 10 --stiffness-ncov 1 --dzero 4.5 --output-dir 1ggg-1wdn #i################################################################################ +++ Spectraldom-run.py analysis... #i################################################################################ K values for local maxima: [2, 5, 8] Best clusterings are: ['1ggg-1wdn/spectraldom-clustering-1ggg-A-k2.txt', '1ggg-1wdn/spectraldom-clustering-1ggg-A-k5.txt', '1ggg-1wdn/spectraldom-clustering-1ggg-A-k8.txt'] === Quality plot dumped: okular 1ggg-1wdn/1ggg-1wdn-quality-score-plot.pdf pdb-list here: [('data/DB-molecular-motions/1wdn.pdb', '1wdn', 'A')] data/DB-molecular-motions/1wdn.pdb pdbid:data/DB-molecular-motions/1wdn.pdb model idx:0 chain_id:A aa_in_primary_seq:226 aa_in_struct:223 resid_min:4 resid_max:226 resid_span:223 226 ; ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k10.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k2.txt Dumped 1ggg-1wdn/domains-1wdn-A-k2.spec (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k3.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k4.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k5.txt Dumped 1ggg-1wdn/domains-1wdn-A-k5.spec (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k6.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k7.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k8.txt Dumped 1ggg-1wdn/domains-1wdn-A-k8.spec (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k9.txt === (vmd) Commands for local maxima--in order of decreasing score: vmd -e 1ggg-1wdn/vmd-1wdn-A-k2.vmd vmd -e 1ggg-1wdn/vmd-1wdn-A-k5.vmd vmd -e 1ggg-1wdn/vmd-1wdn-A-k8.vmd --- Spectraldom-run.py analysis done === (rmsdcomb) Command to run the combined RMSD calculation: sbl-rmsd-flexible-proteins-kpax.exe -f 1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k2.spec --load-chains A -f 1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k5.spec --load-chains A -f 1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k8.spec --load-chains A === (dfam) Command to run DFAM on pairs of consecutive local maxima sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k10.txt -f 1ggg-1wdn/dfam-1wdn-A-k2.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k2.txt -f 1ggg-1wdn/dfam-1wdn-A-k3.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k3.txt -f 1ggg-1wdn/dfam-1wdn-A-k4.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k4.txt -f 1ggg-1wdn/dfam-1wdn-A-k5.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k5.txt -f 1ggg-1wdn/dfam-1wdn-A-k6.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k6.txt -f 1ggg-1wdn/dfam-1wdn-A-k7.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k7.txt -f 1ggg-1wdn/dfam-1wdn-A-k8.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k8.txt -f 1ggg-1wdn/dfam-1wdn-A-k9.txt --diameter-constraint 2 --num-iterations 100 Quality plot file: 1ggg-1wdn/1ggg-1wdn-quality-score-plot.pdf GPL Ghostscript 10.02.1 (2023-11-01) Copyright (C) 2023 Artifex Software, Inc. All rights reserved. This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY: see the file COPYING for details. Processing pages 1 through 1. Page 1
In [3]:
# Note that the previous python command executes the following -- parameters removed when default options are used # We re-execute the command and display the scores obtained cmd = "sbl-spectraldom.exe --mode DM --filename data/DB-molecular-motions/1wdn.pdb --load-chains A --output-dir 1ggg-1wdn" os.system(cmd) SBL_pytools.show_txt_file("scores-best.csv", "1ggg-1wdn")
Pdb files found : Collected data/DB-molecular-motions/1wdn.pdb A +++ Running spectraldom in mode Spectraldom_DM Spectraldom mode DM/ENM: valid length and num residues are 223 223 residues ->Polypeptide_chain_contacts_finder... ...covalent_contacts num contacts 445 ...non_covalent_contacts contacts with distance threshold 10.00 is 4148 ->ENM::convert_contact_maps_to_triplets_ijstiff... ...stiffness CT_COV=10.00 : processing 445 contacts ...stiffness CT_NCOV=1.00 : processing 4148 contacts ...ENM check: num Calphas is 223
...Total num of contacts irrespective of type is 4593
...ENM::fill_DM_weights_matrix_from_residue_contacts in stiffness mode Stiffness_mode_max, matrix size 223 x 223 End compute_graph with 223 vertices and 4148 edges Spectral_clustering::compute_least_k_eigenvals ...Spectraldom: running k_means for k = 2 3 4 5 6 7 8 9 10 ...Spectraldom: dumping scores /home/fcazals/projects/proj-soft/sbl/Applications/Spectral_domain_explorer/demos/Spectral_domain_explorer/1ggg-1wdn/scores-best.csv ...Spectraldom: dumping scores /home/fcazals/projects/proj-soft/sbl/Applications/Spectral_domain_explorer/demos/Spectral_domain_explorer/1ggg-1wdn/scores-repeats.csv ---Spectraldom done ---Spectraldom done ++Showing file 1ggg-1wdn/scores-best.csv k,distance_stdev,ave_ratio_distances_NN1_NN2,norm_ave_ratio_distances_NN1_NN2,BIC 2,0.193751,11.7061,3.64167,1771.54 3,0.290992,6.14169,3.11373,1767.03 4,0.428361,3.73126,2.26465,1627.85 5,0.404445,4.06242,2.71721,1802.58 6,0.426511,3.50872,2.47955,1884.71 7,0.44494,3.27514,2.39112,1955.24 8,0.44334,3.49367,2.60198,2082.8 9,0.474657,3.38504,2.56302,2073.6 10,0.468722,3.14924,2.41573,2218.93 --Done
II. Example for the Multiple Sequence Alignment mode¶
In [4]:
# 1. Align chains A of and show the resulting FASTA file pdb_align_exe = subprocess.run("which sbl-pdb-align.py", shell=True, stdout=subprocess.PIPE).stdout.decode('utf-8').strip() cmd = "%s %s -f data/DB-molecular-motions/1ggg.pdb data/DB-molecular-motions/1wdn.pdb --chains A A --output-dir 1ggg-1wdn" % (py, pdb_align_exe) print(cmd) os.system(cmd) SBL_pytools.show_txt_file( "1ggg-1wdn.fasta", "1ggg-1wdn") # 2. Run spectraldom cmd = f"{py} {spectraldom_run_exe} --msa-fpath 1ggg-1wdn/1ggg-1wdn.fasta --structures-fpath 1ggg-1wdn/1ggg-1wdn.txt --output-dir 1ggg-1wdn" print(cmd) os.system(cmd) qplot = SBL_pytools.find_file_in_output_directory("quality-score-plot.pdf", "1ggg-1wdn") print("Quality plot file:", qplot) SBL_pytools.show_pdf_file(qplot, 200)
/usr/bin/python3 /user/fcazals/home/projects/proj-soft/sbl/scripts/sbl-pdb-align.py -f data/DB-molecular-motions/1ggg.pdb data/DB-molecular-motions/1wdn.pdb --chains A A --output-dir 1ggg-1wdn pdb_id1ggg.pdb and chain A: Sequence has length: 226 pdb_id1wdn.pdb and chain A: Sequence has length: 226 +++Seq align with Bio.pairwise2 num alignments with scores: 1 [1164.0] Dumped txt file 1ggg-1wdn/1ggg-1wdn.txt Dumped fasta file 1ggg-1wdn/1ggg-1wdn.fasta +++Struct align, num residues chain_1:220 chain_2:222 intersection:220 lRMSD for the id alignment using 220 residues: 5.34 Dumped file data/DB-molecular-motions/1wdn-aligned.pdb PDB files: data/DB-molecular-motions/1ggg.pdb data/DB-molecular-motions/1wdn-aligned.pdb vmd -e 1ggg-1wdn/1ggg-A--1ggg-A.vmd ++Showing file 1ggg-1wdn/1ggg-1wdn.fasta >data/DB-molecular-motions/1ggg.pdb_A ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK >data/DB-molecular-motions/1wdn.pdb_A ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK --Done /usr/bin/python3 /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-spectraldom-run.py --msa-fpath 1ggg-1wdn/1ggg-1wdn.fasta --structures-fpath 1ggg-1wdn/1ggg-1wdn.txt --output-dir 1ggg-1wdn Pdb files found : Collected data/DB-molecular-motions/1ggg.pdb A Collected data/DB-molecular-motions/1wdn.pdb A +++ Running spectraldom in mode Spectraldom_MSA Spectraldom mode MSA: valid length and alignment length are 220 226 residues Spectraldom::initialize_sparse_fluctuation_matrix upper triangular used/unused/total/n*n: 1281/22809/24090/48400 Distance threshold used for contacts dmax=10.00 MSA Calpha_dist upper triangular matrix of dim 220 and with with binom(n,2) = 1281 entries MSA fluctuations computed: 1281 pairs out of 220x220=48400 possible ones T_Spectraldom::compute_similarity_matrix_from_fluctuations_vai:: fluctuation mean/median are 0.48/0.25 End compute_graph with 220 vertices and 1281 edges Spectral_clustering::compute_least_k_eigenvals ...Spectraldom: running k_means for k = 2 3 4 5 6 7 8 9 10 ...Spectraldom: dumping scores /home/fcazals/projects/proj-soft/sbl/Applications/Spectral_domain_explorer/demos/Spectral_domain_explorer/1ggg-1wdn/scores-best.csv ...Spectraldom: dumping scores /home/fcazals/projects/proj-soft/sbl/Applications/Spectral_domain_explorer/demos/Spectral_domain_explorer/1ggg-1wdn/scores-repeats.csv ---Spectraldom done ---Spectraldom done
/usr/local/lib64/python3.11/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3924. warnings.warn( /usr/local/lib64/python3.11/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 4004. warnings.warn( /usr/local/lib64/python3.11/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2187. warnings.warn(
['/user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-spectraldom.exe', '--kmin', '2', '--kmax', '10', '--dmax-contacts', '10', '--mode', 'MSA', '--msa-fpath', '1ggg-1wdn/1ggg-1wdn.fasta', '--structures-fpath', '1ggg-1wdn/1ggg-1wdn.txt', '--output-dir', '1ggg-1wdn'] /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-spectraldom.exe --kmin 2 --kmax 10 --dmax-contacts 10 --mode MSA --msa-fpath 1ggg-1wdn/1ggg-1wdn.fasta --structures-fpath 1ggg-1wdn/1ggg-1wdn.txt --output-dir 1ggg-1wdn #i################################################################################ +++ Spectraldom-run.py analysis... #i################################################################################ K values for local maxima: [2, 5, 9] Best clusterings are: ['1ggg-1wdn/spectraldom-clustering-1ggg-A-k2.txt', '1ggg-1wdn/spectraldom-clustering-1ggg-A-k5.txt', '1ggg-1wdn/spectraldom-clustering-1ggg-A-k9.txt'] === Quality plot dumped: okular 1ggg-1wdn/1ggg-1wdn-quality-score-plot.pdf pdb-list here: [('data/DB-molecular-motions/1ggg.pdb', '1ggg', 'A'), ('data/DB-molecular-motions/1wdn.pdb', '1wdn', 'A')] data/DB-molecular-motions/1ggg.pdb pdbid:data/DB-molecular-motions/1ggg.pdb model idx:0 chain_id:A aa_in_primary_seq:226 aa_in_struct:220 resid_min:5 resid_max:224 resid_span:220 226 ; ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK pdbid:data/DB-molecular-motions/1ggg.pdb model idx:0 chain_id:B aa_in_primary_seq:226 aa_in_struct:220 resid_min:5 resid_max:224 resid_span:220 226 ; ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k10.txt (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k2.txt Dumped 1ggg-1wdn/domains-1ggg-A-k2.spec (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k3.txt (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k4.txt (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k5.txt Dumped 1ggg-1wdn/domains-1ggg-A-k5.spec (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k6.txt (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k7.txt (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k8.txt (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k9.txt Dumped 1ggg-1wdn/domains-1ggg-A-k9.spec === (vmd) Commands for local maxima--in order of decreasing score: vmd -e 1ggg-1wdn/vmd-1ggg-A-k2.vmd vmd -e 1ggg-1wdn/vmd-1ggg-A-k5.vmd vmd -e 1ggg-1wdn/vmd-1ggg-A-k9.vmd data/DB-molecular-motions/1wdn.pdb pdbid:data/DB-molecular-motions/1wdn.pdb model idx:0 chain_id:A aa_in_primary_seq:226 aa_in_struct:223 resid_min:4 resid_max:226 resid_span:223 226 ; ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k10.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k2.txt Dumped 1ggg-1wdn/domains-1wdn-A-k2.spec (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k3.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k4.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k5.txt Dumped 1ggg-1wdn/domains-1wdn-A-k5.spec (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k6.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k7.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k8.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k9.txt Dumped 1ggg-1wdn/domains-1wdn-A-k9.spec === (vmd) Commands for local maxima--in order of decreasing score: vmd -e 1ggg-1wdn/vmd-1wdn-A-k2.vmd vmd -e 1ggg-1wdn/vmd-1wdn-A-k5.vmd vmd -e 1ggg-1wdn/vmd-1wdn-A-k9.vmd --- Spectraldom-run.py analysis done === (rmsdcomb) Command to run the combined RMSD calculation: sbl-rmsd-flexible-proteins-kpax.exe -f 1ggg.pdb --domain-labels 1ggg-1wdn/domains-1ggg-A-k2.spec --load-chains A -f 1ggg.pdb --domain-labels 1ggg-1wdn/domains-1ggg-A-k5.spec --load-chains A -f 1ggg.pdb --domain-labels 1ggg-1wdn/domains-1ggg-A-k9.spec --load-chains A -f 1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k2.spec --load-chains A -f 1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k5.spec --load-chains A -f 1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k9.spec --load-chains A === (dfam) Command to run DFAM on pairs of consecutive local maxima sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k10.txt -f 1ggg-1wdn/dfam-1wdn-A-k2.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k2.txt -f 1ggg-1wdn/dfam-1wdn-A-k3.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k3.txt -f 1ggg-1wdn/dfam-1wdn-A-k4.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k4.txt -f 1ggg-1wdn/dfam-1wdn-A-k5.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k5.txt -f 1ggg-1wdn/dfam-1wdn-A-k6.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k6.txt -f 1ggg-1wdn/dfam-1wdn-A-k7.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k7.txt -f 1ggg-1wdn/dfam-1wdn-A-k8.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k8.txt -f 1ggg-1wdn/dfam-1wdn-A-k9.txt --diameter-constraint 2 --num-iterations 100 Quality plot file: 1ggg-1wdn/1ggg-1wdn-quality-score-plot.pdf GPL Ghostscript 10.02.1 (2023-11-01) Copyright (C) 2023 Artifex Software, Inc. All rights reserved. This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY: see the file COPYING for details. Processing pages 1 through 1. Page 1
Iterative alignment with Kpax: we align the same two chains with the flexible aligner Kpax¶
In [5]:
cmd = "sbl-kpax-iterative-alignment.exe -f data/DB-molecular-motions/1ggg.pdb -f data/DB-molecular-motions/1wdn.pdb --load-chains A --load-chains A -v 2" os.system(cmd)
Running: sbl-kpax-iterative-alignment.exe -f data/DB-molecular-motions/1ggg.pdb -f data/DB-molecular-motions/1wdn.pdb --load-chains A --load-chains A -v 2 Protein representation loader Statistics: Number of loaded PDB files: 2 Details for each file: -- file 1 data/DB-molecular-motions/1ggg.pdb: -- -- File exists: Yes -- -- File is valid: Yes -- -- Number of models: 1 -- -- Number of loaded atoms (of model with lowest index): 1714 -- -- Number of loaded SS bonds: 0 -- -- Number of discarded atoms based on model id (all models): 0 -- -- Number of discarded atoms based on chain id (all models): 1793 -- -- Number of discarded atoms based on b factor beeing too high (all models): 0 -- -- Number of discarded waters (all models): 159 -- -- Number of discarded hetatoms (all models): 159 -- -- Number of discarded hydrogens (all models): 0 -- -- Number of discarded alternate location atoms (all models): 0 -- -- Number of actually discarded atoms (all models): 1873 -- file 2 data/DB-molecular-motions/1wdn.pdb: -- -- File exists: Yes -- -- File is valid: Yes -- -- Number of models: 1 -- -- Number of loaded atoms (of model with lowest index): 1740 -- -- Number of loaded SS bonds: 0 -- -- Number of discarded atoms based on model id (all models): 0 -- -- Number of discarded atoms based on chain id (all models): 9 -- -- Number of discarded atoms based on b factor beeing too high (all models): 0 -- -- Number of discarded waters (all models): 122 -- -- Number of discarded hetatoms (all models): 122 -- -- Number of discarded hydrogens (all models): 0 -- -- Number of discarded alternate location atoms (all models): 0 -- -- Number of actually discarded atoms (all models): 131 Covalent Structure File Loader statistics: -- -- Number of molecular covalent structures: 2 -- Molecular covalent structure 0: -- -- Number of particles: 7150 -- -- Number of embedded particles: 1714 -- -- Number of bonds: 7212 -- -- Number of embedded bonds: 1744 -- -- Number of disulfide bonds: 0 -- Molecular covalent structure 1: -- -- Number of particles: 3593 -- -- Number of embedded particles: 1740 -- -- Number of bonds: 3623 -- -- Number of embedded bonds: 1771 -- -- Number of disulfide bonds: 0 Protein Representation Loader statistics: Number of loaded proteins: 2 Details for each protein : -- structure 1: -- -- Number of chains: 1 -- -- Number of residues: 220 -- -- Number of particles: 3575 -- -- Number of embedded particles: 1714 -- -- Number of bonds: 3606 -- -- Number of embedded bonds: 1744 -- structure 2: -- -- Number of chains: 1 -- -- Number of residues: 223 -- -- Number of particles: 3575 -- -- Number of embedded particles: 1740 -- -- Number of bonds: 3606 -- -- Number of embedded bonds: 1771 Iterative alignment Starting module SBL::Modules::T_Alignment_structures_module...
Seeder_DP_score:alignment size:221
Statistics: Alignment length: 164 Alignment score : 74.435674 Alignment identity percentage: 0.378049 Alignment similarity percentage: 0.445122 Alignment lRMSD : 3.275461 Alignment dRMSD : 2.395408 Report... Ending module SBL::Modules::T_Alignment_structures_module... End Run General Statistics: Times elapsed for computations (in seconds): -- Iterative alignment: 2.172181 Total: 2.172181
Out[5]:

0
Combined lRMSD calculation: we use the domains identified to compute a combined RMSD. In comparing the obtained value against that of a flexible aligner such as Kpax, keep in mind that the number of residues¶
taken into account is in general much larger¶
In [6]:
cmd="sbl-rmsd-flexible-proteins-kpax.exe -f data/DB-molecular-motions/1ggg.pdb --domain-labels 1ggg-1wdn/domains-1ggg-A-k2.spec --load-chains A -f data/DB-molecular-motions/1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k2.spec --load-chains A --directory-output 1ggg-1wdn" os.system(cmd) SBL_pytools.show_txt_file("rmsd-flexible-proteins-kpax__labels_lrmsd.txt", "1ggg-1wdn")
++Showing file 1ggg-1wdn/sbl-rmsd-flexible-proteins-kpax__labels_lrmsd.txt Label comparison between chain A of and chain A of -- one line: label lRMSD #amino acids Dom-0 3.809 124 Dom-1 4.062 96 Weighted lRMSD 3.921 involving a total of 220 amino acids --Done
III. Reducing fragmentation¶

We show how to reduce fragmentation using sbl-spectraldom-defrag.py
In [7]:
defrag_exe = subprocess.run("which sbl-spectraldom-defrag.py", shell=True, stdout=subprocess.PIPE).stdout.decode('utf-8').strip() cmd = f"%s %s -r 1ggg-1wdn -s data/DB-molecular-motions/1wdn.pdb -c A" % (py, defrag_exe) os.system(cmd)
Peaks: [2, 5] sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k2.txt -f 1ggg-1wdn/dfam-1wdn-A-k5.txt --diameter-constraint 2 --num-iterations 100 WARNING: VoI modified !! Now it's normalized by 2log(K) Postprocessing_Metaclusters_VoI -0.0 data/DB-molecular-motions/1wdn.pdb pdbid:data/DB-molecular-motions/1wdn.pdb model idx:0 chain_id:A aa_in_primary_seq:226 aa_in_struct:223 resid_min:4 resid_max:226 resid_span:223 226 ; ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k0.txt === (vmd) Commands for local maxima--in order of decreasing score:
/usr/local/lib64/python3.11/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2187. warnings.warn(
Out[7]:

0

Table of Contents

Spectral_domain_explorer

Introduction: from Spectrus to Spectraldom

Using Spectraldom

Single structure: modes ENM and DM

Algorithm

Building contact maps

Building the similarity matrix

Spectral clustering for a fixed number k of clusters/domains

Optimal number of clusters

Hyper-parameters: comments

Implementation

Main classes

Programs and scripts provided

Jupyter demo

Jupyter notebook for the SBL package Spectral_domain_explorer¶

I. Example for the diffusion map (DM) mode¶

II. Example for the Multiple Sequence Alignment mode¶

Iterative alignment with Kpax: we align the same two chains with the flexible aligner Kpax¶

Combined lRMSD calculation: we use the domains identified to compute a combined RMSD. In comparing the obtained value against that of a flexible aligner such as Kpax, keep in mind that the number of residues¶

taken into account is in general much larger¶

III. Reducing fragmentation¶