Molecular cradle: a combined analysis based on conformations, states, and rigid blocks
Main goals. A suitable framework to describe certain biomolecules is that of almost rigid blocks whose relative position changes over time. Stable relative positions are often characteristic of certain states of the mechanism studied. More specifically, the notion of state refers to meta-stable state, that is a coherent ensemble of conformations which may be observed experimentally. Understanding the connexion between such states and groups of (sub)domains which characterize them is the goal of this package. In short, in a manner akin to the analysis of Newton's cradle, we wish to study complex dynamics of a biomolecular system by identifying which regions account for specific states, using static structures only.
This endeavor is in particular tractable when an ensemble of (high-resolution) structures is available, as recently formalized in [162] . This package provides three types of analysis to deal with such systems. To describe them, we take the following for granted:
A list of conformations of sequence identical or homologous polypeptide chains. A conformation is also called a chain instance or instance for short. We may also assume that a number of chain instances have a label corresponding to a particular state in the biomolecular mechanism studied. Instances which do not have a label are termed unlabeled.
A template decomposing the polypeptide chain of interest into subdomains. Typically, subdomains are regions connected by linkers. See e.g. the decompositions used in the package Molecular_distances_flexible .
The three analysis provided in this package. These analysis are:
1. Classifying states of unlabeled monomers. This step is a clustering step aiming at identifying groups of coherent structures, with one state per cluster. The clustering defining these states is called the reference clustering.
2. Identifying subdomains compatible with the states of the reference clustering. The goal is to identify those subdomains characteristic of (selected) states.
3. Characterizing the evolution of Voronoi interfaces between subdomains across states. Voronoi interfaces characterize non covalent interactions between molecules or domains. The goal is to identify interfaces characteristic of (selected) states. We therefore compare interface sizes between subdomains across states.
These steps are implemented by three scripts, called , and respectively.
Case study: AcrB. As an illustration, we study AcrB, a trimeric membrane protein member of the resistance-nodulation-cell-division superfamily, involved in the active transport of a variety of molecules [162] . Numerous crystal structures of AcrB have been obtained, from which a complex mechanism akin to a peristaltic pump has been inferred. Along this mechanism, which uses the protomotive force, monomers cycle across three states denoted Access (A), Binding (B), and Extrusion (E). See The reader is referred to [162] for a comprehensive bibliography and results obtained with all crystal structures known to date.
General pre-requisites
Molecular distances
The methods provided in this package are suitable in two settings, for which different molecular distances are used:
Case 1: all chain instances share the same primary sequence: the molecular distance used is the classical least RMSD (lRMSD), or possibly the or the combined RMSD [40] ( ), which mixes lRMSD of individual domains or subdomains.
In that case, the executable used is:
: comparing two conformations of the same polypeptide chain, so that amino acids are numbered identically for all chains. (Nb: if selected a.a. are unresolved in one structure or the other, only those a.a. present in both structures are used.)
Case 2: chain instances are homologous: computing a molecular distance requires performing an alignment. Two options are proposed:
: computes the alignment with the aligner from the package Iterative_alignment;
: uses the combinatorial alignment method from the package Apurva .
In the Structural Bioinformatics Library, all molecular distance calculations for proteins can be carried out at four levels: Calpha atoms, backbone atoms, heavy atoms, all atoms – see Molecular_distances_flexible . Since the cradle package makes it possible to handle homologous proteins, the Calpha option is retained.
Mean displacement and the significance of lRMSDs
Consider a (sub)domain of a chain which is known to adopt several states in a mechanism. Given a set of instances of this (sub)domain, for different states of the containing monomer, several lRMSD comparison can be undertaken, as introduced in [162] :
Intra state comparison: distribution of the lRMSD observed for all pairs of instances of (sub)domains within a state.
Inter state comparison: distribution of the lRMSD observed for all pairs of (sub)domains in the Cartesian product of the two states.
On the other hand, in a crystal structure, atomic oscillation amplitudes are related to B-factors by the formula .
Comparing mean displacements against lRMSD of (intra, inter) state comparisons then allows to single out those comparisons which are positive – lRMSD larger than the mean displacement.
Based on these quantities, a classification of subdomains as static, dynamic or unstable is proposed in [162] .
For example, a subdomain which not exhibit any positive intra state comparison, but has at least one positive inter state comparison, is termed dynamic. (Nb: a sufficient but not necessary condition, see [162] .) Such subdomains are used in the second step.
Voronoi interfaces
A molecular space filling model (SFM) is a molecular representation with one ball per atom. The SFM can be partitioned by a so-called Voronoi diagram, which assigns one Voronoi cell to each atom. Furthermore, one can compute the restriction of each atomic ball, as the intersection between the ball and its Voronoi region.
Consider now two domains or subdomains in a molecule. An interfacial pair for these two (sub)domains is a pair of atoms, one on each (sub)domain, such that their Voronoi restriction are in contact – their share a so-called Voronoi facet. The interface size between these two (sub)domains is then defined as the number of interfacial atoms.
In this package, we focus in particular on interfaces between pairs of subdomains, which gives an indication on their spatial proximity as a function of the state considered.
Using Molecular_cradle: script sbl-cradle-step1.py
Pre-requisites
Reference clustering and chain-to-state-mapping. In step 1, we compute a hierarchical clustering of the chain instances using one of the aforementioned molecular distances. The output is a dendogram, for which several linkage options can be used (see options below). It is up to the user to decide whether this clustering yields well separated clusters. If so, a chain-to-state id mapping can be defined, providing a state id for each chain instance. See the illustration below.
The state id of a chain is the character string providing the corresponding state for that chain. (Nb: the number of different strings must equal the number of clusters in the reference clustering.)
Protein decomposition template specification. This template decomposes a chain into subdomains, using the following format:
polypeptide chain and its hierarchical decomposition specified by a triple:
protein name (below: protein-name)
domain name (below: domain). nb: use eg whole if the whole protein is considered
subdomains, each specified by a list of integer intervals. See example below.
The template, one per chain, is defined in a spec file whose name follows the following convention:
protein-name_rmsd-type_domain_chain-id.spec
See example below.
Here the spec file for the whole chain of AcrB:
#the first line starts the template and give it a name
domains-template-begin AcrB
#the following lines contain: the name of the label, then the ranges
#of residues corresponding to this label (including the bounds)
Whole 1-1044
#the star denotes the complementary, i.e all residues not
#mentionned before in the template
# Coil is not taken into account here to maximize interest of flexible RMSD
#COIL **
#terminates the template
end
#enumerates the chains and possibly associates template to them
chains-enumeration-begin
A like AcrB
end
#groups hierarchically the chains
chains-hierarchy-begin
M1 A
end
The decomposition template is optional for step 1 since the clustering is carried out by default on whole molecules by default. If a decomposition template with at least two subdomains is provided, the global lRMSD is replaced by the of these subdomains. The file below provides such a specification, for the so-called Coil2 subdomain of AcrB.
The (sub)domain specification in these spec files must use the exact same name, even for two different proteins.
#the first line starts the template and give it a name
domains-template-begin AcrB
#the following lines contain: the name of the label, then the ranges
#of residues corresponding to this label (including the bounds)
Coil2 132-137
#the star denotes the complementary, i.e all residues not
#mentionned before in the template
# Coil is not taken into account here to maximize interest of flexible RMSD
#COIL **
#terminates the template
end
#enumerates the chains and possibly associates template to them
chains-enumeration-begin
A like AcrB
end
#groups hierarchically the chains
chains-hierarchy-begin
M1 A
end
Protein database specification. The database of structures processed specifies in a file (csv) format the chain instances processed. The database specification is based on (PDB id, protein name, chains id(s)) provided in a csv file. Note that the previous decomposition template is applied to each chain.
Example:
pdb;protein;chains
2dr6;AcrB;ABC
3aoa;AcrB;ABC
3w9h;AcrB;ABC
4dx5;AcrB;ABC
4zit;AcrB;ABCDEF
Optional: chain-to-state map file. It may happen that the state of selected (but not all) chains is known. If so, the mapping chain-to-state is specified in a text file (csv format). For the sake of presentation, it is also requested to specify one color per state, to be used to display the leaves of the dendogram produced by the hierarchical clustering.
The number of entries in this file is at most the number of structures in the database. A state whose chain/color is not specified is display in black in the clustering.
The following file illustrate this mapping for AcrB, using the three states A, B, and E:
chain-to-state mapping
PDB;state;color
3w9h_A;A;red
4dx5_A;A;red
4zit_A;A;red
2dr6_C;A;red
3aoa_C;A;red
4zit_D;A;red
2dr6_A;B;cyan
3aoa_A;B;cyan
3w9h_B;B;cyan
4dx5_B;B;cyan
4zit_B;B;cyan
4zit_E;B;cyan
3w9h_C;E;green
4dx5_C;E;green
4zit_C;E;green
2dr6_B;E;green
3aoa_B;E;green
4zit_F;E;green
Input of the script
The main options of the program sbl-cradle-step1.py are: (-d, –idir)string: directory containing all input files (required) (-x, –exe_type)string: alignment type used in case of multiple protein comparisons (default: kpax) (-c, –csv_file_info)string: CSV file containing the database specification (pdb;Protein;chains) (required) (-sf, –statefile)string: CSV file providing the chain-to-state mapping (pdb_chain;state id) (optional) (-l, –linkage)string: Linkage type for the hierarchical clustering: "average" (default), "single", "complete", "ward" (-w, –workflow)string(= Workflow type: lRMSD computation (computation), clustering (analysis), both (both): default)
(Main output step1-1) Dendogram generated by the hierarchical clustering. As noted above, the leaves are colored by states when the chain-to-sate mapping is provided.
(Main output step1-2) Aggregated matrix with all pairwise distances ( or )
(Main output step1-3) Aggregated matrix with size of the alignments which accompany the distance calculation. This file is of special interest if homologous chains are compared, to make sure that comparable numbers of a.a. are used in all distances calculations.
The comparison of molecular distances should be accompanied by a check of the number of residues involved, as a small alignment size typically results in a smaller distance. This is particularly critical when comparing homologous proteins.
Using Molecular_cradle: script sbl-cradle-step2.py
Pre-requisites
Recall that step 2 aim to identify subdomains which are coherent with the reference clustering. Two pieces of information are required to do so:
decomposition template for each chain,
chain-to-state mapping for all chains.
For complex molecules, it is advised to carry out this step for conformations of the same chain only. Indeed, homologous proteins may not have the same dynamic subdomains.
Input of the script
The main options of the program sbl-cradle-step2.py are: (-d, –pdbdir)string: directory containing input pdb files (required) (-s, –specdir)string: directory containing input spec files (required) (-x, –exe_type)string: alignment type used in case of multiple protein comparisons (default: kpax) (-c, –csv_file_info)string: CSV file containing the database specification (pdb;Protein;chains) (required) (-sf, –statefile)string: CSV file providing the chain-to-state mapping (pdb_chain;state id) (optional) (-l, –linkage)string: Linkage type for the hierarchical clustering: "average" (default), "single", "complete", "ward" (-w, –workflow)string: Workflow type: lRMSD computation (computation), clustering (analysis), both (both, default)
Output of the script
Main output generated are:
(Main output step2-1) Mean displacement for each subdomain.
(Main output step2-2) For each subdomain: table comparing the mean displacement against the inter-states and intra-states lRMSD.
(Main output step2-3) Table listing the dynamic subdomains.
(Main output step2-4) Table summarizing the correctness of the clusterings obtained for each subdomain under the .
(Main output step2-5) Table summarizing the correctness of clusterings obtained for subsets of dynamics subdomains under .
Using Molecular_cradle: script sbl-cradle-step3.py
Pre-requisites
Recall that step 3 aims at characterizing the stability/evolution of interfaces between subdomains when changing states. The pre-requisites are identical to those of step 2.
For complex molecules, it is advised to carry out this step for conformations of the same chain only. Indeed, homologous proteins may not have identical interfaces between subdomains.
Input of the script
The main options of the program sbl-cradle-step3.py are: (-d, –pdbdir)string: directory containing input pdb files (required) (-s, –specdir)string: directory containing input spec files (required) (-c, –csv_file_info)string: CSV file containing the database specification (pdb;Protein;chains) (required) (-sf, –statefile)string: CSV file providing the chain-to-state mapping (pdb_chain;state id) (optional)
Output of the script
Main output generated are:
(Main output step3-1) Matrix of interface sizes between subdomains. Since the matrix is symmetric, the upper triangular part only is filled. For a given pair, three values are reported, one per state; the value for a state is the median value of the interface size (number of interfacial atoms) observed for all interfaces between the two domains of interest, in a given state.
Using Molecular_cradle: script sbl-cradle-step3.py
Algorithms and Methods
The analysis provided by the previous scripts involve the following three steps:
Step 1 : Classifying states of unlabeled monomers. The main classes are:
# We display the values for intra and inter subdomain comparisons. (Main output step2-2)# u : mean displacement, AvsA lrmsd or A versus A, ...sblpyt.show_this_text_file("subdomains_lrmsd_u.csv")
# We display the table with all interfaces and their size for each step (see paper for notations) (Main output step3-1) sblpyt.show_this_text_file("median_result_matrix.csv")