Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Spectral_domain_explorer

Authors: F. Cazals and J. Herrmann and E. Sarti

Introduction: from Spectrus to Spectraldom

Defining domains within a polypeptide chain or protein amounts to maximizing intra-domain interactions while minimizing inter-domain interactions. This problem is closely related to graph partitioning techniques, and in particular spectral clustering [171].

The $\text{\spectrus}$ algorithm from [145] proposes a solution to this problem using spectral partitioning applied to a coarse-grain model involving $\Calpha$ carbons. Consider pairs of residues whose distance $\dij$ (the distance between the $\Calpha$ carbons) is less than some threshold $\dmax$. For these residues, assume one is able to compute a pairwise fluctuation

akin to the standard deviation of their mutual distance:

\begin{equation} f_{i,j} = \sqrt{ \brave{d_{i,j}^2} - \brave{d_{i,j}}^2 }. \end{equation}

This fluctuation is turned into a similarity $\wij$, and the corresponding matrix $W=(\wij)_{ij}$ is passed to spectral clustering to infer a predefined number of domains/clusters $k$ – a number which is subsequently optimized.

This package implements $\text{\spectraldom}$ [48], which simplifies and extends the original $\text{\spectrus}$ algorithm, offering three modes:

  • Spectraldom_ENM: the original $\text{\spectrus}$ mode [145], which applies to a single molecule. The fluctuation of Eq. eq-fluct is derived from an elastic network model (ENM) and the associated normal modes [13].
  • Spectraldom_DM: this new diffusion maps (DM) mode which simplifies the previous mode by plainly defining the similarity matrix from stiffness constants associated to pairwise covalent and non-covalent interactions. That is, the stiffnesses are directly used as weight/similarity matrix for spectral clustering, removing the calculations of normal modes.
  • Spectraldom_MSA: this new Multiple Sequence Alignment mode, to handle a collection of conformations of homologous proteins. In this setting, the fluctuation of Eq. eq-fluct is computed from the atomic coordinates of $\Calpha$ carbons provided by the structures.

Using Spectraldom

Single structure: modes ENM and DM


Background. We first provide some key elements.

$\text{\spectraldom}$ proposes the aforementioned three modes (ENM, DM, MSA), which can be ascribed to two major modes: single structure (ENM, DM), and multiple structures (MSA).

Modes ENM and DM require a PDB/mmCIF file and the specification of a chain. These two modes use stiffness constants to encode pairwise contacts between residues. The default values are as follows:

  • Covalent contacts: $\stiffc{ij} =10$;
  • Non-covalent contacts: $\stiffnc{ij}=1$;
  • Hydrogen bonding: $\stiffhbond{ij}=1/2$.

Note that two amino acids may interact in different ways. For example, two a.a. on the same generatrix of a $\alpha$-helix interact non-covalently, but also via hydrogen bonding. We therefore define two modes to define the weight $\wij$ of two residues with multiple interaction types, the $\text{\spectraldomWmax}$ (resp. $\text{\spectraldomWsum}$) mode in which we take the max (resp. the sum) of individual stiffnesses. In our helix example, one gets $\gamma_{ij} = \max(\stiffnc, \stiffhbond)$, and $\gamma_{ij} = \stiffnc+\stiffhbond$ respectively.

Mode MSA requires a multiple sequence alignment in FASTA format, as well as a file giving access to the structures processed. A position in the MSA is termed valid provided that no dash is present in that column, and all residues have been resolved in that structure. See below for details.

For a fixed value of $k$, a score $\spectrusscore{k}{n}$ is compute–see details below. This score is an assessment of the corresponding clustering with respect to a null model. The decomposition yielded by $\text{\spectrus}$ is not deterministic since the $\text{\kmeanspp}$ with random initialization is internally used [12]. The stability assessment is obtained via a quality plot , namely the boxplot $\spectrusscore{k}{n}$ obtained for a fixed number $N_r(=10)$ of runs/repeats of $\text{\kmeanspp}$.

The main options of $\text{\spectraldom}$ are:

-f arg string: Filepath to PDB or mmCIF file
–load-chains string: Chain of the PDB or mmCIF file to be processed
–mode arg (=DM) string: Mode: MSA, DM, ENM
–msa-fpath arg string: Filepath for the MSA in fasta format
–structures-fpath arg string: Filepath of the file containing a list of pairs filepath of PDB/mmCIF file + associated chain
-r [ –result-dir ] arg string: Output directory
–dmax-contacts arg (=10) string: Dmax distance threshold for non covalent contacts
–stiffness-cov arg (=10) string: Stiffness for covalent contacts
–stiffness-ncov arg (=1) string: Stiffness for non-covalent contacts
–stiffness-hbond arg (=0.5) string: Stiffness for hydrogen bonds
–sse string: Use SSE info from PDB or mmCIF header to define h-bonds
–stiffness-mode arg (=max) string: Stiffness mode: sum or max
–dzero arg (=4.5) string: Distance d0 used to soften the stiffnesses–Eq eq-gamma-soft
–sigma arg (=0.5) string: Stdev used to soften the stiffnesses–Eq eq-gamma-soft
–kmin arg (=2) string: Min num of clusters
–kmax arg (=10) string: Max num of clusters


NB: run $\text{\sblspectraldom}$ to get all options.

Option structures-fpath takes as arg the filepath of the file containing a list of pairs filepath of PDB/mmCIF file + associated chain. Here is an example of such a file
>a/b/c/1ggg.pdb_A ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK >a/b/c/1wdn.pdb_A ADKKLVVATDTAFVPFEFKQGDLYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPALQTKNVDLALAGITITDERKKAIDFSDGYYKSGLLVMVKANNNDVKSVKDLDGKVVAVKSGTGSVDYAKANIKTKDLRQFPNIDNAYMELGTNRADAVLHDTPNILYFIKTAGNGQFKAVGDSLEAQQYGIAFPKGSDELRDKVNGALKTLRENGTYNEIYKKWFGTEPK




Output. We noted above that $\text{\spectraldom}$ internally resorts to $N_r(=10)$ runs of $\text{\kmeanspp}$. This said, the output is as follows:

  • Two csv files containing scores: scores_repeats.csv, i.e. the scores obtained for all repeats and all values of $k$; and scores_best.csv: the best scores for each value of $k$ – thus $k*N_r$ values.

As illustrated in the notebook below, each line contains 5 numbers:

k,distance_stdev,ave_ratio_distances_NN1_NN2,norm_ave_ratio_distances_NN1_NN2,BIC
  • For each structure and each value of $k$, one text file spectraldom-clustering-pdbid-chaind-k.txt containing a list of pairs $(\text{resid} \in 1..\text{num\_valid\_residues}, \text{cluster\_index} \in 0..k-1)$. As detailed below, valid residues are defined using - in the MSA and missing residues in structures.
  • For each structure and each value of $k$, a VMD visualization state grouping all residue of the same domain/cluster in the same representation. We also create a $(k+1)-th$ representation, in CPK mode, for those resides which are present in a structure, but are invalidated by the MSA.

Algorithm

We now review the main steps of the algorithm, see [145] and [48]. The user is also referred to Section Implementation for more details on the C++ classes involved.

Building contact maps

We record contacts between pairs of residues, identified by their sequence numbers.

  • Covalent contacts (CT_COV). These contacts are directly derived from the covalent graph. The stiffness constant associated with a covalent contact is denoted $\stiffc (=10)$.
  • Non-covalent contacts (CT_NCOV). We take for granted a method to report neighoring heavy atoms in a 3D structure, via range searching. A distance cutoff applies, see below. The stiffness constant associated with a covalent contact is denoted $\stiffnc (=1)$.
  • Hydrogen-bonding contacts (CT_HBOND). These contacts are retrieved from secondary structures provided in the header of the PDB/mmCIF file processed. The corresponding stiffness constant is denoted $\stiffhbond{ij} (=1/2)$.
Other types of contacts can easily be accommodated: CT_SS: contacts for disulfide bonds, CT_SALTB: contacts for salt bridges, etc.


Building the similarity matrix

We now explain how the weights $\wij$ of the similarity matrix used for spectral clustering are computed.


Mode Spectraldom_ENM. Following [13], an elastic network model (ENM) is built from covalent and non covalent contacts. The corresponding normal modes are computed, from which distance fluctuations for pairs of residues are obtained [13]. Denote $\sigma$ the fluctuation stdev over all pairs. The fluctuation $f_{ij}$ is turned into the similarity $w_{ij}$ as follows:

\begin{equation} \wij = \expL{ -\frac{f_{ij}^2} { 2 \sigma^2}}. \end{equation}


Mode Spectraldom_DM. As noted above, each pair of neighboring / interacting residues is assigned one or several stiffness values. In case of duplicate values for two residues, the max value is used – mode $\text{\spectraldomWmax}$. To alleviate the incidence of the distance threshold $\dmax$ used to identify neighboring residues, each stiffness constant is softened as follows–with $d_0(=4\ANG)$ a reference distance:

\begin{equation} \wij = \gamma_{ij}*exp( -\frac{\fabs{\dij-d_0}}{\sigma} ). \end{equation}

We take $d_0 =4.5\ang$ as typical distance between consecutive $\Calpha$s along the backbone, and $\sigma=1/2$.

The corresponding matrix $W$ is directly used by the spectral clustering algorithm.


Mode Spectraldom_MSA. The multiple sequence alignment (MSA) mode handles the case of molecules with similar structures and disimilar sequences. Given a MSA, the atomic distance fluctuation of Eq. eq-fluct is computed from the distances observed in the structures.

Spectral clustering for a fixed number k of clusters/domains

The similarity matrix $W$ is used to perform spectral clustering, see [135] and [171] .

Consider the edge weighted undirected graph whose vertices are the individual residues, and whose edge weights are the $\wij$. Define the degree of a vertex as the sum of similarities for incident edges, and let $D$ be the corresponding diagonal matrix.

Consider the usual unnormalized Laplacian $L = D-W$. The clustering method uses the eigenvectors associated with the lowest $k$ eigenvalues of the symmetric Laplacian $\lapSym = D^{-1/2} L D^{-1/2}$ .

Then, one clusters the normalized rows of the $n\times k$ matrix of eigenvectors truncated to their first $k$ coordinates. This clustering step on the unit sphere $\Sd{k-1}$ yields $k$ clusters.

The original $\text{\spectrus}$ uses k-medoid instead of k-means [145] . We use $\text{\kmeanspp}$ instead, implemented in a generic fashion to accommodate points on the unit sphere $\Sd{k-1}$. Note that in this generic version, the distance between a data point and a cluster center is the length of the shortest great circle arc joining them. Likewise, a center of mass used in k-means in constrained to lie on the unit sphere $\Sd{k-1}$. The generic implementation of $\text{\kmeanspp}$ summarizes all scores of a clustering within the class SBL::GT::K_means_scores.


Optimal number of clusters

Fixed value of k. For a fixed value of $k$, we assess the quality of the clustering using the fitness score used in [145] .

Assume there are $n$ residues. Spectral clustering requires clustering $n$ unit vectors on the sphere $\Sd{k-1}$. Recall that in $\text{\kmeans}$ or k-medoids, one assigns a data point to the nearest center. To assess whether this assignment is stable, one computes the ratio of distances to the first and second nearest centers. (When clusters are well separated, this ratio should be small.) This average ratio for $n$ points is denoted $\kmratiokn{k}{n}$. To compare this value to a null model, let $\kmratiorefkn{k}{n}$ be the same computed for $n$ random points $\Sd{k-1}$ – these reference values are called prefactors in the sequel. The $\text{\spectrus}$ fitness score retained for that value of $k$ reads as

\begin{equation} \spectrusscore{k}{n} = \frac{\kmratiokn{k}{n}}{\kmratiorefkn{k}{n}}. \end{equation}

Repeats. For a fixed value of k, we perform $N_r=10$ repeats, and retain the clustering with the best fitness score.

Range for k and repeats. Consider now a range of values for $k \in [\kmin, \kmax]$. The previous step is iterated to optimize the number of domains in this range. Moreover, for each $k \in [\kmin, \kmax]$, a default number of repeats $N_r=10$ clusterings is performed and the best is retained for that value of $k$.

Quality plot. The quality plot is the box plot of the score Eq. eq-spectrusscore.

On the quality plots, the values of $k$ such that primary sequence size divided by the number of domains is less than 35 amino acids (default size for a helix) are shaded, as decomposing a structure into SSE is not the goal pursued.


Hyper-parameters: comments

The following comments are in order regarding all parameters used:

  • Due to Eq. eq-gamma-soft, the segmentation obtained is robust with respect to the values of $\stiffc{ij}, \stiffnc{ij}, \stiffhbond{ij}$, and also to those of $d_0$ and $\sigma$.
  • On the other hand, the value of $\dmax$ used to identify neighboring residues matters, and a default value $\dmax = 10$ is used. Smaller values yield sparse graphs which favor an over-segmentation.

Implementation

The implementation requires a diverse set of pieces, which we now briefly present.

Main classes


Filters for residues in structures. The class SBL::CSB::T_Chains_residues_contiguous_indexer is used to record the residues present in the structures associated with the sequences of the MSA. Contiguous indices for resids are the contiguous integers which reference the resids, irrespective of gaps.


Multiple sequence alignments. Consider a MSA of length $N$, and assume each sequence has length $s_i$. We recall selected notions from the package Alignment_engines.

In order for atomic fluctuations to be computed using the same number of pairs (i.e. the binomial coefficient $\binom{\text{num. sequences}}{2}$), we define:

A MSA position is valid provided there are no dashes in a MSA column. The number of valid positions is denoted $N_v\leq N$.


The class SBL::CSB::T_Multiple_sequence_alignment_DS is the data structure to handle MSA; it provides in particular the following two mappings:

  • first, alignment indices to sequence indices, i.e. maps from $0..N-1$ to $0..s_i-1$
  • second, alignment indices to valid sequences indices, i.e. maps from $0..N-1$ to $0..N_v$.

The previous class overlooks the presence of residues in the structures associated to the sequences of the MSI.

The class SBL::CSB::T_Multiple_sequence_alignment_DS_filtered has the same interface as SBL::CSB::T_Multiple_sequence_alignment_DS, except that it implements an extra check based on residues associated with a MSA position:

Consider a MSA and the associated structures. A position of the MSA is valid if it is valid if the sense of Def. def-valid-msa-bis with no gap tolerated, and if the residue associated to a MSA position is present in every structure.


In practice, this latter check is provided by the class SBL::CSB::T_Chains_residues_contiguous_indexer. (Recall that this class contains a map from a residue sequence number (minus 1) to contiguous indices.)


Representation of polypeptide chains. The class SBL::CSB::T_Polypeptide_chain_representation from the package Protein_representation is used to access all pieces of information on polypeptide chains (covalent neighbors, atomic coordinates).


Contacts. Given a polypeptide chain, consider the following classes of contacts, defined as interacting residues

  • CT_COV: covalent contacts
  • CT_NCOV: non covalent contacts
  • CT_HBOND: contact corresponding to hydrogen bonds
  • CT_SB: salt bridges
  • CT_SS: disulfide bonds

The class SBL::CSB::T_Polypeptide_chain_contacts_finder finds all such contacts, and provides them as pairs of residue sequence numbers –in (min, max) order. The search of non covalent contacts uses the spatial search engine from CGAL, based on Kd trees.

Instantiating this class uses the local class SBL::CSB::T_Point_xyz_with_atom , which contains the point used for spatial search (CGAL), and also knows the atom from the polypeptide chain representation.


Elatic Network Model and pairwise interactions. Given one polypeptide chain, the class SBL::CSB::T_Elastic_network_model implements the ENM model from [13] , providing in particular a calculation of anisotropic fluctuations associated with normal modes.

As usual, the structure processed may have missing residues at the C-ter and/or N-ter, and also within the structure. Also, its residue sequence numbers may start with negative values. On the other hand, indices in Eigen start at 0. Therefore, residue sequence numbers are converted to contiguous indices using the class SBL::CSB::T_Chains_residues_contiguous_indexer.

An ENM model can be seen as a graph whose vertices are residues, connected by edges.

The $\Calpha$s are collected and inserted into a vector indexed by contiguous indices.

Contacts are sought between residue identified by their sequence numbers. These contacts are also converted to contiguous indices which can be used in Eigen matrices – Eigen indices start at 0.

Linear algebra operations are carried out using the Eigen library.


Clustering and spectral clustering. Spectral clustering is implemented in the class SBL::GT::T_Spectral_clustering, which is parameterized by a traits class provide an edge weighted boost graph. The weight $\wij$ associated with an edge is used to define the Laplacian and weighted Laplacians of this graph [171] .

The class SBL::GT::T_Spectral_clustering internally uses the generic version of $\text{\kmeanspp}$ which handles points on the unit sphere $\Sd{k-1}$, via the class SBL::GT::T_Cluster_engine_k_means .

As a sanity check, spectral clustering uses the post-condition based on Eigen::Success to ensure that the computation of eigenvalues/vectors succeeded. Naturally, the corresponding assert is discarded for optimized compiled codes.



Prefactors for kmeans. Prefactors are the denominators involved in Eq. eq-spectrusscore . They are simply computed by running $\text{\kmeanspp}$ for random points on the unit sphere $\Sd{k-1}$, for various values of $k$. In practice, a table is stored in a cpp file at compile time, see below.


Spectraldom. We finally arrive at the class T_Spectraldom which implements the aforementioned three modes. Given that the DM and ENM modes use a single structure but not MSA, we use the following design pattern to handle the three modes coherently and comply with Defs. def-valid-msa-bis and def-valid-msa-struct-bis :

The Spectraldom traits class requires two types:


Back to structures from clusters. Let $N_v$ be the number of valid indices – Def. def-valid-msa-struct-bis. Spectral clustering returns clusters containing indices in the range $0..N_v-1$. One obtains the residue_sequence_numbers using the reverse maps provided by the class SBL::CSB::T_Multiple_sequence_alignment_DS .

Programs and scripts provided


Main executable: $\text{\sblspectraldom}$: the main program implementing $\text{\spectraldom}$, as described above.


Main script: sbl-spectraldom-run.py: a script calling $\text{\sblspectraldom}$, and also providing a number of post-processing steps, including (i) the generation of the quality plot (ii) the generation of visualization files for VMD, and (iii) the generation of spec files to compute $\text{\rmsdcomb}$ using the package Molecular_distances_flexible .


Reducing fragmentation: sbl-spectraldom-defrag.py: a script used to remove fragmentation of the domains identified. See paper [48] for details.


Handling prefactors. To compute and manipulate prefactors involved in the scores $\spectrusscore{k}{n}$, the package provides

  • sbl-spectraldom-generate-prefactor.cpp/exe : runs $\text{\kmeanspp}$ for random points on a sphere of various dimension, and generates a csv file containing the prefactors;
  • sbl-spectraldom-prefactor-header-writer.py : using the previous csv file, generates kmeans-prefactors.hpp, which is used to compile $\text{\sblspectraldom}$.

Jupyter demo

See the following jupyter demos:

  • Jupyter notebook file
  • Spectral_domain_explorer