Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
|
Authors: F. Cazals and J. Herrmann and E. Sarti
Defining domains within a polypeptide chain or protein amounts to maximizing intra-domain interactions while minimizing inter-domain interactions. This problem is closely related to graph partitioning techniques, and in particular spectral clustering [171].
The algorithm from [145] proposes a solution to this problem using spectral partitioning applied to a coarse-grain model involving carbons. Consider pairs of residues whose distance (the distance between the carbons) is less than some threshold . For these residues, assume one is able to compute a pairwise fluctuation
akin to the standard deviation of their mutual distance:
This fluctuation is turned into a similarity , and the corresponding matrix is passed to spectral clustering to infer a predefined number of domains/clusters – a number which is subsequently optimized.
This package implements [48], which simplifies and extends the original algorithm, offering three modes:
Background. We first provide some key elements.
proposes the aforementioned three modes (ENM, DM, MSA), which can be ascribed to two major modes: single structure (ENM, DM), and multiple structures (MSA).
Modes ENM and DM require a PDB/mmCIF file and the specification of a chain. These two modes use stiffness constants to encode pairwise contacts between residues. The default values are as follows:
Note that two amino acids may interact in different ways. For example, two a.a. on the same generatrix of a -helix interact non-covalently, but also via hydrogen bonding. We therefore define two modes to define the weight of two residues with multiple interaction types, the (resp. ) mode in which we take the max (resp. the sum) of individual stiffnesses. In our helix example, one gets , and respectively.
Mode MSA requires a multiple sequence alignment in FASTA format, as well as a file giving access to the structures processed. A position in the MSA is termed valid provided that no dash is present in that column, and all residues have been resolved in that structure. See below for details.
For a fixed value of , a score is compute–see details below. This score is an assessment of the corresponding clustering with respect to a null model. The decomposition yielded by is not deterministic since the with random initialization is internally used [12]. The stability assessment is obtained via a quality plot , namely the boxplot obtained for a fixed number of runs/repeats of .
The main options of are:
NB: run to get all options.
Output. We noted above that internally resorts to runs of . This said, the output is as follows:
As illustrated in the notebook below, each line contains 5 numbers:
k,distance_stdev,ave_ratio_distances_NN1_NN2,norm_ave_ratio_distances_NN1_NN2,BIC
We now review the main steps of the algorithm, see [145] and [48]. The user is also referred to Section Implementation for more details on the C++ classes involved.
We record contacts between pairs of residues, identified by their sequence numbers.
We now explain how the weights of the similarity matrix used for spectral clustering are computed.
Mode Spectraldom_ENM. Following [13], an elastic network model (ENM) is built from covalent and non covalent contacts. The corresponding normal modes are computed, from which distance fluctuations for pairs of residues are obtained [13]. Denote the fluctuation stdev over all pairs. The fluctuation is turned into the similarity as follows:
Mode Spectraldom_DM. As noted above, each pair of neighboring / interacting residues is assigned one or several stiffness values. In case of duplicate values for two residues, the max value is used – mode . To alleviate the incidence of the distance threshold used to identify neighboring residues, each stiffness constant is softened as follows–with a reference distance:
We take as typical distance between consecutive s along the backbone, and .
The corresponding matrix is directly used by the spectral clustering algorithm.
Mode Spectraldom_MSA. The multiple sequence alignment (MSA) mode handles the case of molecules with similar structures and disimilar sequences. Given a MSA, the atomic distance fluctuation of Eq. eq-fluct is computed from the distances observed in the structures.
The similarity matrix is used to perform spectral clustering, see [135] and [171] .
Consider the edge weighted undirected graph whose vertices are the individual residues, and whose edge weights are the . Define the degree of a vertex as the sum of similarities for incident edges, and let be the corresponding diagonal matrix.
Consider the usual unnormalized Laplacian . The clustering method uses the eigenvectors associated with the lowest eigenvalues of the symmetric Laplacian .
Then, one clusters the normalized rows of the matrix of eigenvectors truncated to their first coordinates. This clustering step on the unit sphere yields clusters.
Fixed value of k. For a fixed value of , we assess the quality of the clustering using the fitness score used in [145] .
Assume there are residues. Spectral clustering requires clustering unit vectors on the sphere . Recall that in or k-medoids, one assigns a data point to the nearest center. To assess whether this assignment is stable, one computes the ratio of distances to the first and second nearest centers. (When clusters are well separated, this ratio should be small.) This average ratio for points is denoted . To compare this value to a null model, let be the same computed for random points – these reference values are called prefactors in the sequel. The fitness score retained for that value of reads as
Repeats. For a fixed value of k, we perform repeats, and retain the clustering with the best fitness score.
Range for k and repeats. Consider now a range of values for . The previous step is iterated to optimize the number of domains in this range. Moreover, for each , a default number of repeats clusterings is performed and the best is retained for that value of .
Quality plot. The quality plot is the box plot of the score Eq. eq-spectrusscore.
The following comments are in order regarding all parameters used:
The implementation requires a diverse set of pieces, which we now briefly present.
Filters for residues in structures. The class SBL::CSB::T_Chains_residues_contiguous_indexer is used to record the residues present in the structures associated with the sequences of the MSA. Contiguous indices for resids are the contiguous integers which reference the resids, irrespective of gaps.
Multiple sequence alignments. Consider a MSA of length , and assume each sequence has length . We recall selected notions from the package Alignment_engines.
In order for atomic fluctuations to be computed using the same number of pairs (i.e. the binomial coefficient ), we define:
The class SBL::CSB::T_Multiple_sequence_alignment_DS is the data structure to handle MSA; it provides in particular the following two mappings:
The previous class overlooks the presence of residues in the structures associated to the sequences of the MSI.
The class SBL::CSB::T_Multiple_sequence_alignment_DS_filtered has the same interface as SBL::CSB::T_Multiple_sequence_alignment_DS, except that it implements an extra check based on residues associated with a MSA position:
In practice, this latter check is provided by the class SBL::CSB::T_Chains_residues_contiguous_indexer. (Recall that this class contains a map from a residue sequence number (minus 1) to contiguous indices.)
Representation of polypeptide chains. The class SBL::CSB::T_Polypeptide_chain_representation from the package Protein_representation is used to access all pieces of information on polypeptide chains (covalent neighbors, atomic coordinates).
Contacts. Given a polypeptide chain, consider the following classes of contacts, defined as interacting residues
The class SBL::CSB::T_Polypeptide_chain_contacts_finder finds all such contacts, and provides them as pairs of residue sequence numbers –in (min, max) order. The search of non covalent contacts uses the spatial search engine from CGAL, based on Kd trees.
Instantiating this class uses the local class SBL::CSB::T_Point_xyz_with_atom , which contains the point used for spatial search (CGAL), and also knows the atom from the polypeptide chain representation.
Elatic Network Model and pairwise interactions. Given one polypeptide chain, the class SBL::CSB::T_Elastic_network_model implements the ENM model from [13] , providing in particular a calculation of anisotropic fluctuations associated with normal modes.
As usual, the structure processed may have missing residues at the C-ter and/or N-ter, and also within the structure. Also, its residue sequence numbers may start with negative values. On the other hand, indices in Eigen start at 0. Therefore, residue sequence numbers are converted to contiguous indices using the class SBL::CSB::T_Chains_residues_contiguous_indexer.
An ENM model can be seen as a graph whose vertices are residues, connected by edges.
The s are collected and inserted into a vector indexed by contiguous indices.
Contacts are sought between residue identified by their sequence numbers. These contacts are also converted to contiguous indices which can be used in Eigen matrices – Eigen indices start at 0.
Linear algebra operations are carried out using the Eigen library.
Clustering and spectral clustering. Spectral clustering is implemented in the class SBL::GT::T_Spectral_clustering, which is parameterized by a traits class provide an edge weighted boost graph. The weight associated with an edge is used to define the Laplacian and weighted Laplacians of this graph [171] .
The class SBL::GT::T_Spectral_clustering internally uses the generic version of which handles points on the unit sphere , via the class SBL::GT::T_Cluster_engine_k_means .
Prefactors for kmeans. Prefactors are the denominators involved in Eq. eq-spectrusscore . They are simply computed by running for random points on the unit sphere , for various values of . In practice, a table is stored in a cpp file at compile time, see below.
Spectraldom. We finally arrive at the class T_Spectraldom which implements the aforementioned three modes. Given that the DM and ENM modes use a single structure but not MSA, we use the following design pattern to handle the three modes coherently and comply with Defs. def-valid-msa-bis and def-valid-msa-struct-bis :
The Spectraldom traits class requires two types:
Back to structures from clusters. Let be the number of valid indices – Def. def-valid-msa-struct-bis. Spectral clustering returns clusters containing indices in the range . One obtains the residue_sequence_numbers using the reverse maps provided by the class SBL::CSB::T_Multiple_sequence_alignment_DS .
Main executable: : the main program implementing , as described above.
Main script: sbl-spectraldom-run.py: a script calling , and also providing a number of post-processing steps, including (i) the generation of the quality plot (ii) the generation of visualization files for VMD, and (iii) the generation of spec files to compute using the package Molecular_distances_flexible .
Reducing fragmentation: sbl-spectraldom-defrag.py: a script used to remove fragmentation of the domains identified. See paper [48] for details.
Handling prefactors. To compute and manipulate prefactors involved in the scores , the package provides
See the following jupyter demos:
This notebook shows how to lauch the commands and comments on the outpout. It shows the quality plot obtained, but does not perform an interactive visualization of the structures -- which should be done with VMD or pymol. To run the notebook, a number of executables / scripts from the SBL must be available from one's PATH:
To find domains:
To perform flexible alignments
To compare the domains obtained against structural similarities identified by the flexible aligner Kpax:
To reduce the fragmentation of domains using the D-family-matching algorithm
import os, subprocess
from SBL_pytools import SBL_pytools
py = subprocess.run("which python3", shell=True, stdout=subprocess.PIPE).stdout.decode('utf-8').strip()
spectraldom_run_exe = subprocess.run("which sbl-spectraldom-run.py", shell=True, stdout=subprocess.PIPE).stdout.decode('utf-8').strip()
# We use sbl-spectraldom-run.py to run sbl-spectraldom.exe on chain A of 1ggg.pdb
# NB: the script dumps the command invoking sbl-spectraldom.exe
cmd = "%s %s --pdb-fpath data/DB-molecular-motions/1wdn.pdb --load-chains A --output-dir 1ggg-1wdn" % (py, spectraldom_run_exe)
#cmd = "sbl-spectraldom-run.py --pdb-fpath data/DB-molecular-motions/1wdn.pdb --load-chains A --output-dir 1ggg-1wdn"
print(cmd)
os.system(cmd)
ofile = SBL_pytools.find_file_in_output_directory("quality-score-plot.pdf", "1ggg-1wdn")
print("Quality plot file:", ofile)
SBL_pytools.show_pdf_file(ofile, 200)
/usr/bin/python3 /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-spectraldom-run.py --pdb-fpath data/DB-molecular-motions/1wdn.pdb --load-chains A --output-dir 1ggg-1wdn Pdb files found : Collected data/DB-molecular-motions/1wdn.pdb A +++ Running spectraldom in mode Spectraldom_DM Spectraldom mode DM/ENM: valid length and num residues are 223 223 residues ->Polypeptide_chain_contacts_finder... ...covalent_contacts num contacts 445 ...non_covalent_contacts contacts with distance threshold 10.00 is 4148 ->ENM::convert_contact_maps_to_triplets_ijstiff... ...stiffness CT_COV=10.00 : processing 445 contacts ...stiffness CT_NCOV=1.00 : processing 4148 contacts ...ENM check: num Calphas is 223 ...ENM::fill_DM_weights_matrix_from_residue_contacts in stiffness mode Stiffness_mode_max, matrix size 223 x 223 End compute_graph with 223 vertices and 4148 edges Spectral_clustering::compute_least_k_eigenvals ...Spectraldom: running k_means for k = 2 3 4 5 6 7 8 9 10 ...Spectraldom: dumping scores /home/fcazals/projects/proj-soft/sbl/Applications/Spectral_domain_explorer/demos/Spectral_domain_explorer/1ggg-1wdn/scores-best.csv ...Spectraldom: dumping scores /home/fcazals/projects/proj-soft/sbl/Applications/Spectral_domain_explorer/demos/Spectral_domain_explorer/1ggg-1wdn/scores-repeats.csv ---Spectraldom done ---Spectraldom done
/usr/local/lib64/python3.11/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2187. warnings.warn(
['/user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-spectraldom.exe', '--kmin', '2', '--kmax', '10', '--dmax-contacts', '10', '--mode', 'DM', '--filename', 'data/DB-molecular-motions/1wdn.pdb', '--load-chains', 'A', '--stiffness-cov', '10', '--stiffness-ncov', '1', '--dzero', '4.5', '--output-dir', '1ggg-1wdn'] /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-spectraldom.exe --kmin 2 --kmax 10 --dmax-contacts 10 --mode DM --filename data/DB-molecular-motions/1wdn.pdb --load-chains A --stiffness-cov 10 --stiffness-ncov 1 --dzero 4.5 --output-dir 1ggg-1wdn #i################################################################################ +++ Spectraldom-run.py analysis... #i################################################################################ K values for local maxima: [2, 5, 8] Best clusterings are: ['1ggg-1wdn/spectraldom-clustering-1ggg-A-k2.txt', '1ggg-1wdn/spectraldom-clustering-1ggg-A-k5.txt', '1ggg-1wdn/spectraldom-clustering-1ggg-A-k8.txt'] === Quality plot dumped: okular 1ggg-1wdn/1ggg-1wdn-quality-score-plot.pdf pdb-list here: [('data/DB-molecular-motions/1wdn.pdb', '1wdn', 'A')] data/DB-molecular-motions/1wdn.pdb pdbid:data/DB-molecular-motions/1wdn.pdb model idx:0 chain_id:A aa_in_primary_seq:226 aa_in_struct:223 resid_min:4 resid_max:226 resid_span:223 226 ; ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k10.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k2.txt Dumped 1ggg-1wdn/domains-1wdn-A-k2.spec (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k3.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k4.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k5.txt Dumped 1ggg-1wdn/domains-1wdn-A-k5.spec (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k6.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k7.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k8.txt Dumped 1ggg-1wdn/domains-1wdn-A-k8.spec (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k9.txt === (vmd) Commands for local maxima--in order of decreasing score: vmd -e 1ggg-1wdn/vmd-1wdn-A-k2.vmd vmd -e 1ggg-1wdn/vmd-1wdn-A-k5.vmd vmd -e 1ggg-1wdn/vmd-1wdn-A-k8.vmd --- Spectraldom-run.py analysis done === (rmsdcomb) Command to run the combined RMSD calculation: sbl-rmsd-flexible-proteins-kpax.exe -f 1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k2.spec --load-chains A -f 1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k5.spec --load-chains A -f 1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k8.spec --load-chains A === (dfam) Command to run DFAM on pairs of consecutive local maxima sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k10.txt -f 1ggg-1wdn/dfam-1wdn-A-k2.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k2.txt -f 1ggg-1wdn/dfam-1wdn-A-k3.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k3.txt -f 1ggg-1wdn/dfam-1wdn-A-k4.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k4.txt -f 1ggg-1wdn/dfam-1wdn-A-k5.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k5.txt -f 1ggg-1wdn/dfam-1wdn-A-k6.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k6.txt -f 1ggg-1wdn/dfam-1wdn-A-k7.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k7.txt -f 1ggg-1wdn/dfam-1wdn-A-k8.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k8.txt -f 1ggg-1wdn/dfam-1wdn-A-k9.txt --diameter-constraint 2 --num-iterations 100 Quality plot file: 1ggg-1wdn/1ggg-1wdn-quality-score-plot.pdf GPL Ghostscript 10.02.1 (2023-11-01) Copyright (C) 2023 Artifex Software, Inc. All rights reserved. This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY: see the file COPYING for details. Processing pages 1 through 1. Page 1
# Note that the previous python command executes the following -- parameters removed when default options are used
# We re-execute the command and display the scores obtained
cmd = "sbl-spectraldom.exe --mode DM --filename data/DB-molecular-motions/1wdn.pdb --load-chains A --output-dir 1ggg-1wdn"
os.system(cmd)
SBL_pytools.show_txt_file("scores-best.csv", "1ggg-1wdn")
Pdb files found : Collected data/DB-molecular-motions/1wdn.pdb A +++ Running spectraldom in mode Spectraldom_DM Spectraldom mode DM/ENM: valid length and num residues are 223 223 residues ->Polypeptide_chain_contacts_finder... ...covalent_contacts num contacts 445 ...non_covalent_contacts contacts with distance threshold 10.00 is 4148 ->ENM::convert_contact_maps_to_triplets_ijstiff... ...stiffness CT_COV=10.00 : processing 445 contacts ...stiffness CT_NCOV=1.00 : processing 4148 contacts ...ENM check: num Calphas is 223
...Total num of contacts irrespective of type is 4593
...ENM::fill_DM_weights_matrix_from_residue_contacts in stiffness mode Stiffness_mode_max, matrix size 223 x 223 End compute_graph with 223 vertices and 4148 edges Spectral_clustering::compute_least_k_eigenvals ...Spectraldom: running k_means for k = 2 3 4 5 6 7 8 9 10 ...Spectraldom: dumping scores /home/fcazals/projects/proj-soft/sbl/Applications/Spectral_domain_explorer/demos/Spectral_domain_explorer/1ggg-1wdn/scores-best.csv ...Spectraldom: dumping scores /home/fcazals/projects/proj-soft/sbl/Applications/Spectral_domain_explorer/demos/Spectral_domain_explorer/1ggg-1wdn/scores-repeats.csv ---Spectraldom done ---Spectraldom done ++Showing file 1ggg-1wdn/scores-best.csv k,distance_stdev,ave_ratio_distances_NN1_NN2,norm_ave_ratio_distances_NN1_NN2,BIC 2,0.193751,11.7061,3.64167,1771.54 3,0.290992,6.14169,3.11373,1767.03 4,0.428361,3.73126,2.26465,1627.85 5,0.404445,4.06242,2.71721,1802.58 6,0.426511,3.50872,2.47955,1884.71 7,0.44494,3.27514,2.39112,1955.24 8,0.44334,3.49367,2.60198,2082.8 9,0.474657,3.38504,2.56302,2073.6 10,0.468722,3.14924,2.41573,2218.93 --Done
# 1. Align chains A of and show the resulting FASTA file
pdb_align_exe = subprocess.run("which sbl-pdb-align.py", shell=True, stdout=subprocess.PIPE).stdout.decode('utf-8').strip()
cmd = "%s %s -f data/DB-molecular-motions/1ggg.pdb data/DB-molecular-motions/1wdn.pdb --chains A A --output-dir 1ggg-1wdn" % (py, pdb_align_exe)
print(cmd)
os.system(cmd)
SBL_pytools.show_txt_file( "1ggg-1wdn.fasta", "1ggg-1wdn")
# 2. Run spectraldom
cmd = f"{py} {spectraldom_run_exe} --msa-fpath 1ggg-1wdn/1ggg-1wdn.fasta --structures-fpath 1ggg-1wdn/1ggg-1wdn.txt --output-dir 1ggg-1wdn"
print(cmd)
os.system(cmd)
qplot = SBL_pytools.find_file_in_output_directory("quality-score-plot.pdf", "1ggg-1wdn")
print("Quality plot file:", qplot)
SBL_pytools.show_pdf_file(qplot, 200)
/usr/bin/python3 /user/fcazals/home/projects/proj-soft/sbl/scripts/sbl-pdb-align.py -f data/DB-molecular-motions/1ggg.pdb data/DB-molecular-motions/1wdn.pdb --chains A A --output-dir 1ggg-1wdn pdb_id1ggg.pdb and chain A: Sequence has length: 226 pdb_id1wdn.pdb and chain A: Sequence has length: 226 +++Seq align with Bio.pairwise2 num alignments with scores: 1 [1164.0] Dumped txt file 1ggg-1wdn/1ggg-1wdn.txt Dumped fasta file 1ggg-1wdn/1ggg-1wdn.fasta +++Struct align, num residues chain_1:220 chain_2:222 intersection:220 lRMSD for the id alignment using 220 residues: 5.34 Dumped file data/DB-molecular-motions/1wdn-aligned.pdb PDB files: data/DB-molecular-motions/1ggg.pdb data/DB-molecular-motions/1wdn-aligned.pdb vmd -e 1ggg-1wdn/1ggg-A--1ggg-A.vmd ++Showing file 1ggg-1wdn/1ggg-1wdn.fasta >data/DB-molecular-motions/1ggg.pdb_A ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK >data/DB-molecular-motions/1wdn.pdb_A ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK --Done /usr/bin/python3 /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-spectraldom-run.py --msa-fpath 1ggg-1wdn/1ggg-1wdn.fasta --structures-fpath 1ggg-1wdn/1ggg-1wdn.txt --output-dir 1ggg-1wdn Pdb files found : Collected data/DB-molecular-motions/1ggg.pdb A Collected data/DB-molecular-motions/1wdn.pdb A +++ Running spectraldom in mode Spectraldom_MSA Spectraldom mode MSA: valid length and alignment length are 220 226 residues Spectraldom::initialize_sparse_fluctuation_matrix upper triangular used/unused/total/n*n: 1281/22809/24090/48400 Distance threshold used for contacts dmax=10.00 MSA Calpha_dist upper triangular matrix of dim 220 and with with binom(n,2) = 1281 entries MSA fluctuations computed: 1281 pairs out of 220x220=48400 possible ones T_Spectraldom::compute_similarity_matrix_from_fluctuations_vai:: fluctuation mean/median are 0.48/0.25 End compute_graph with 220 vertices and 1281 edges Spectral_clustering::compute_least_k_eigenvals ...Spectraldom: running k_means for k = 2 3 4 5 6 7 8 9 10 ...Spectraldom: dumping scores /home/fcazals/projects/proj-soft/sbl/Applications/Spectral_domain_explorer/demos/Spectral_domain_explorer/1ggg-1wdn/scores-best.csv ...Spectraldom: dumping scores /home/fcazals/projects/proj-soft/sbl/Applications/Spectral_domain_explorer/demos/Spectral_domain_explorer/1ggg-1wdn/scores-repeats.csv ---Spectraldom done ---Spectraldom done
/usr/local/lib64/python3.11/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3924. warnings.warn( /usr/local/lib64/python3.11/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 4004. warnings.warn( /usr/local/lib64/python3.11/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2187. warnings.warn(
['/user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-spectraldom.exe', '--kmin', '2', '--kmax', '10', '--dmax-contacts', '10', '--mode', 'MSA', '--msa-fpath', '1ggg-1wdn/1ggg-1wdn.fasta', '--structures-fpath', '1ggg-1wdn/1ggg-1wdn.txt', '--output-dir', '1ggg-1wdn'] /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-spectraldom.exe --kmin 2 --kmax 10 --dmax-contacts 10 --mode MSA --msa-fpath 1ggg-1wdn/1ggg-1wdn.fasta --structures-fpath 1ggg-1wdn/1ggg-1wdn.txt --output-dir 1ggg-1wdn #i################################################################################ +++ Spectraldom-run.py analysis... #i################################################################################ K values for local maxima: [2, 5, 9] Best clusterings are: ['1ggg-1wdn/spectraldom-clustering-1ggg-A-k2.txt', '1ggg-1wdn/spectraldom-clustering-1ggg-A-k5.txt', '1ggg-1wdn/spectraldom-clustering-1ggg-A-k9.txt'] === Quality plot dumped: okular 1ggg-1wdn/1ggg-1wdn-quality-score-plot.pdf pdb-list here: [('data/DB-molecular-motions/1ggg.pdb', '1ggg', 'A'), ('data/DB-molecular-motions/1wdn.pdb', '1wdn', 'A')] data/DB-molecular-motions/1ggg.pdb pdbid:data/DB-molecular-motions/1ggg.pdb model idx:0 chain_id:A aa_in_primary_seq:226 aa_in_struct:220 resid_min:5 resid_max:224 resid_span:220 226 ; ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK pdbid:data/DB-molecular-motions/1ggg.pdb model idx:0 chain_id:B aa_in_primary_seq:226 aa_in_struct:220 resid_min:5 resid_max:224 resid_span:220 226 ; ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k10.txt (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k2.txt Dumped 1ggg-1wdn/domains-1ggg-A-k2.spec (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k3.txt (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k4.txt (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k5.txt Dumped 1ggg-1wdn/domains-1ggg-A-k5.spec (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k6.txt (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k7.txt (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k8.txt (dfam) Dumped 1ggg-1wdn/dfam-1ggg-A-k9.txt Dumped 1ggg-1wdn/domains-1ggg-A-k9.spec === (vmd) Commands for local maxima--in order of decreasing score: vmd -e 1ggg-1wdn/vmd-1ggg-A-k2.vmd vmd -e 1ggg-1wdn/vmd-1ggg-A-k5.vmd vmd -e 1ggg-1wdn/vmd-1ggg-A-k9.vmd data/DB-molecular-motions/1wdn.pdb pdbid:data/DB-molecular-motions/1wdn.pdb model idx:0 chain_id:A aa_in_primary_seq:226 aa_in_struct:223 resid_min:4 resid_max:226 resid_span:223 226 ; ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k10.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k2.txt Dumped 1ggg-1wdn/domains-1wdn-A-k2.spec (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k3.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k4.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k5.txt Dumped 1ggg-1wdn/domains-1wdn-A-k5.spec (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k6.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k7.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k8.txt (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k9.txt Dumped 1ggg-1wdn/domains-1wdn-A-k9.spec === (vmd) Commands for local maxima--in order of decreasing score: vmd -e 1ggg-1wdn/vmd-1wdn-A-k2.vmd vmd -e 1ggg-1wdn/vmd-1wdn-A-k5.vmd vmd -e 1ggg-1wdn/vmd-1wdn-A-k9.vmd --- Spectraldom-run.py analysis done === (rmsdcomb) Command to run the combined RMSD calculation: sbl-rmsd-flexible-proteins-kpax.exe -f 1ggg.pdb --domain-labels 1ggg-1wdn/domains-1ggg-A-k2.spec --load-chains A -f 1ggg.pdb --domain-labels 1ggg-1wdn/domains-1ggg-A-k5.spec --load-chains A -f 1ggg.pdb --domain-labels 1ggg-1wdn/domains-1ggg-A-k9.spec --load-chains A -f 1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k2.spec --load-chains A -f 1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k5.spec --load-chains A -f 1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k9.spec --load-chains A === (dfam) Command to run DFAM on pairs of consecutive local maxima sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k10.txt -f 1ggg-1wdn/dfam-1wdn-A-k2.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k2.txt -f 1ggg-1wdn/dfam-1wdn-A-k3.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k3.txt -f 1ggg-1wdn/dfam-1wdn-A-k4.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k4.txt -f 1ggg-1wdn/dfam-1wdn-A-k5.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k5.txt -f 1ggg-1wdn/dfam-1wdn-A-k6.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k6.txt -f 1ggg-1wdn/dfam-1wdn-A-k7.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k7.txt -f 1ggg-1wdn/dfam-1wdn-A-k8.txt --diameter-constraint 2 --num-iterations 100 sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k8.txt -f 1ggg-1wdn/dfam-1wdn-A-k9.txt --diameter-constraint 2 --num-iterations 100 Quality plot file: 1ggg-1wdn/1ggg-1wdn-quality-score-plot.pdf GPL Ghostscript 10.02.1 (2023-11-01) Copyright (C) 2023 Artifex Software, Inc. All rights reserved. This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY: see the file COPYING for details. Processing pages 1 through 1. Page 1
cmd = "sbl-kpax-iterative-alignment.exe -f data/DB-molecular-motions/1ggg.pdb -f data/DB-molecular-motions/1wdn.pdb --load-chains A --load-chains A -v 2"
os.system(cmd)
Running: sbl-kpax-iterative-alignment.exe -f data/DB-molecular-motions/1ggg.pdb -f data/DB-molecular-motions/1wdn.pdb --load-chains A --load-chains A -v 2 Protein representation loader Statistics: Number of loaded PDB files: 2 Details for each file: -- file 1 data/DB-molecular-motions/1ggg.pdb: -- -- File exists: Yes -- -- File is valid: Yes -- -- Number of models: 1 -- -- Number of loaded atoms (of model with lowest index): 1714 -- -- Number of loaded SS bonds: 0 -- -- Number of discarded atoms based on model id (all models): 0 -- -- Number of discarded atoms based on chain id (all models): 1793 -- -- Number of discarded atoms based on b factor beeing too high (all models): 0 -- -- Number of discarded waters (all models): 159 -- -- Number of discarded hetatoms (all models): 159 -- -- Number of discarded hydrogens (all models): 0 -- -- Number of discarded alternate location atoms (all models): 0 -- -- Number of actually discarded atoms (all models): 1873 -- file 2 data/DB-molecular-motions/1wdn.pdb: -- -- File exists: Yes -- -- File is valid: Yes -- -- Number of models: 1 -- -- Number of loaded atoms (of model with lowest index): 1740 -- -- Number of loaded SS bonds: 0 -- -- Number of discarded atoms based on model id (all models): 0 -- -- Number of discarded atoms based on chain id (all models): 9 -- -- Number of discarded atoms based on b factor beeing too high (all models): 0 -- -- Number of discarded waters (all models): 122 -- -- Number of discarded hetatoms (all models): 122 -- -- Number of discarded hydrogens (all models): 0 -- -- Number of discarded alternate location atoms (all models): 0 -- -- Number of actually discarded atoms (all models): 131 Covalent Structure File Loader statistics: -- -- Number of molecular covalent structures: 2 -- Molecular covalent structure 0: -- -- Number of particles: 7150 -- -- Number of embedded particles: 1714 -- -- Number of bonds: 7212 -- -- Number of embedded bonds: 1744 -- -- Number of disulfide bonds: 0 -- Molecular covalent structure 1: -- -- Number of particles: 3593 -- -- Number of embedded particles: 1740 -- -- Number of bonds: 3623 -- -- Number of embedded bonds: 1771 -- -- Number of disulfide bonds: 0 Protein Representation Loader statistics: Number of loaded proteins: 2 Details for each protein : -- structure 1: -- -- Number of chains: 1 -- -- Number of residues: 220 -- -- Number of particles: 3575 -- -- Number of embedded particles: 1714 -- -- Number of bonds: 3606 -- -- Number of embedded bonds: 1744 -- structure 2: -- -- Number of chains: 1 -- -- Number of residues: 223 -- -- Number of particles: 3575 -- -- Number of embedded particles: 1740 -- -- Number of bonds: 3606 -- -- Number of embedded bonds: 1771 Iterative alignment Starting module SBL::Modules::T_Alignment_structures_module...
Seeder_DP_score:alignment size:221
Statistics: Alignment length: 164 Alignment score : 74.435674 Alignment identity percentage: 0.378049 Alignment similarity percentage: 0.445122 Alignment lRMSD : 3.275461 Alignment dRMSD : 2.395408 Report... Ending module SBL::Modules::T_Alignment_structures_module... End Run General Statistics: Times elapsed for computations (in seconds): -- Iterative alignment: 2.172181 Total: 2.172181
0
cmd="sbl-rmsd-flexible-proteins-kpax.exe -f data/DB-molecular-motions/1ggg.pdb --domain-labels 1ggg-1wdn/domains-1ggg-A-k2.spec --load-chains A -f data/DB-molecular-motions/1wdn.pdb --domain-labels 1ggg-1wdn/domains-1wdn-A-k2.spec --load-chains A --directory-output 1ggg-1wdn"
os.system(cmd)
SBL_pytools.show_txt_file("rmsd-flexible-proteins-kpax__labels_lrmsd.txt", "1ggg-1wdn")
++Showing file 1ggg-1wdn/sbl-rmsd-flexible-proteins-kpax__labels_lrmsd.txt Label comparison between chain A of and chain A of -- one line: label lRMSD #amino acids Dom-0 3.809 124 Dom-1 4.062 96 Weighted lRMSD 3.921 involving a total of 220 amino acids --Done
defrag_exe = subprocess.run("which sbl-spectraldom-defrag.py", shell=True, stdout=subprocess.PIPE).stdout.decode('utf-8').strip()
cmd = f"%s %s -r 1ggg-1wdn -s data/DB-molecular-motions/1wdn.pdb -c A" % (py, defrag_exe)
os.system(cmd)
Peaks: [2, 5] sbl-cmp-clust-dfam.exe -f 1ggg-1wdn/dfam-1wdn-A-k2.txt -f 1ggg-1wdn/dfam-1wdn-A-k5.txt --diameter-constraint 2 --num-iterations 100 WARNING: VoI modified !! Now it's normalized by 2log(K) Postprocessing_Metaclusters_VoI -0.0 data/DB-molecular-motions/1wdn.pdb pdbid:data/DB-molecular-motions/1wdn.pdb model idx:0 chain_id:A aa_in_primary_seq:226 aa_in_struct:223 resid_min:4 resid_max:226 resid_span:223 226 ; ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK (dfam) Dumped 1ggg-1wdn/dfam-1wdn-A-k0.txt === (vmd) Commands for local maxima--in order of decreasing score:
/usr/local/lib64/python3.11/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 2187. warnings.warn(
0