Conformational_ensemble_analysis

This package provides methods to assess the conformational diversity of an ensemble of conformations.

Goals

Consider a sampling – aka conformational ensemble, as defined in section Conformational ensembles – aka samplings. This package offers functionalities to:

Assessing the structural diversity of the sampling, using various statistics:
- the atomic fluctuations (root mean square fluctuation, RMSF)
- statistics on the edges of a spanning tree connecting the conformations.
Perform a hierarchical clustering of the conformations.
Computing pairwise distances between (selected) conformations, so as e.g. to run a dimensionality reduction algorithm such as multi-dimensional scaling or Isomap.

These functionalities are provided in the program $\sblCEAL$ .

Using sbl-conf-ensemble-analysis-lrmsd.exe

Pre-requisites

Sampling diversity

Currently we use two measures for the sampling diversity, in order to assess the extent to which a molecule deforms within an ensemble $C = \{ c_1,\dots, c_n\}$ . Assume that all conformations have been aligned in the same coordinate system, see Eq. eq-aligned-conformations.

The first is the standard method for estimating the root-mean squared atom fluctuations ( $\rmsf$ ). Denoting $\average{\atomi{\tilde{c}}{k}}$ the average position of the -th atom in the aligned samples $\tilde{C}$ . The $\rmsf$ of the -th atom is defined by

$RMSF_k = \sqrt{\frac{1}{n} \sum_{i=1,\dots,n} \vvnorm{\atomi{\tilde{c_i}}{k}\ -\ \average{\atomi{\tilde{c}}{k}}}^2. }$

The second measure is the bounding box (in Cartesian coordinates) of the ensemble $\tilde{C}$ of aligned structures.

Sampling sparsity via spanning trees

A conformational ensemble may contain clusters , i.e. groups of conformations such that pairwise distances within a cluster are smaller than distances between conformations across clusters. We investigate such properties using graphs.

Assume that a connected nearest neighbor graph (NNG, see Def. def-nng) has been built over the samples, and denote $\VSet{G}$ (resp. $\ESet{G}$ ) its vertex set (resp. its edge set). Assume that to each edge $e={c_i,c_j}\in \ESet{G}$ is attached the quantity $\dCalC{c_i}{c_j}$ . We compute a minimum spanning tree $\MST{G}$ of . If the conformational ensemble contains conformations, this tree involves edges. For an edge $e={c_i, c_j}$ of this graph, the edge length is the distance between the conformations and , that is $\dCalC{c_i}{c_j}$ . Denoting the edge set of the MST, and $e={ c_i, c_j}$ a particular edge, we then report the following statistical summary for :

$\min_{e\in E} \dCalC{c_i}{c_j}, \underset{e\in E} \median \dCalC{c_i}{c_j}, \max_{e\in E} \dCalC{c_i}{c_j}.$

To intuitively capture the significance of this summary, consider the situation where the ensemble has a cluster structure, with dense regions separated by mostly empty space. In that case, most of the edges are short ones, only those connecting clusters being long ones, which reads plainly from the statistical summary.

Persistence based clustering

When an ensemble features clusters, as e.g. seen from the statistical summary from the edge lengths found in a MST, the next step consists of finding these clusters. Upon estimating the sample density at each sample from the ensemble, a three stage strategy consists of associating one cluster to each local maximum of the estimated density [53] , [93], and to filter out spurious (i.e. small) ones. We now briefly review these steps.

The first task is the density estimation. Assume that a nearest neighbors graph (NNG) has been built, so that each sample is linked to a number of its nearest neighbors, say its nearest neighbors. Intuitively, the density about a sample is inversely proportional to the distances to the nearest neighbors. Formally, denoting the number of nearest neighbors and the volume of the unit ball in $\mathbb{R}^d$ , the local density at sample can be estimated as [22] :

$\hat{f_n}(c) = \frac{1}{n\ V_d} \biggl(\frac{\sum_{j=1}^{k_n} j^{2/d}}{\sum_{j=1}^{k_n}(\dCalC{c}{n^j(c)})^2}\biggr)^{d/2}$

To define clusters, consider the lifted NNG obtained by endowing each sample with the previous estimated density. We define a cluster as the catchment basin (watershed) associated with a local maximum of the estimated density [52] . Along the way, spurious local maxima are filtered out using topological persistence [52] .
Prosaically, this process is analogous to that used in topography to define a peak on a mountain range: a peak is persistent if the elevation drop to the saddle leading to a higher peak, say , is at least some user defined value. If not, the peak may be considered as a secondary peak of . The user is referred to the package Morse_theory_based_analyzer for further details.

Due to the high dimensionality of configuration spaces of molecular systems, the estimate of Eq. (eq-estimated-density) is typically small. For such cases, persistence diagrams used to perform the simplification (see again the package Morse_theory_based_analyzer ) are best plotted in log-log scale.

Computing pairwise distances

Computing pairwise distances between (selected) conformations, so as e.g. to run a dimensionality reduction algorithm such as multi-dimensional scaling or Isomap. Such low dimensional embeddings are indeed highly convenient to visualize the relative position of a set of conformations, e.g. low lying local minima.

To this end, we proceed in two steps:

First, all conformations are aligned in the same coordinate system, as specified by Eq. eq-aligned-conformations.
Second, pairwise Euclidean distances are computed between these conformations. The resulting distance matrix is then easily converted into a Gram matrix, from which multi-dimensional scaling can be executed [116].

Input: Specifications and File Types

The input is a list of conformations of the same molecule, and can be provided with several format:

PDB files: the input is a file listing PDB files of the same molecule
Conformation file: the input is a file listing each conformation as a D-dimensional point, i.e the dimension of the conformation followed by the (x, y, z) coordinates of all the atoms, just separated by a blank
```
6 x11 y11 z11 x12 y12 z12
6 x21 y21 z21 x22 y22 z22
6 x31 y31 z31 x32 y32 z32
...
```
Gromacs xtc file: the input is a file generated by the Gromacs software

It is also possible to provide an XML archive containing a boost graph representing the precomputed nearest neighbours graph (in order to skip the sometimes time consuming construction of the nng). All analysis are optional, so that they have to be specified in the command-line. For example, running analysis over the sampling diversity is done using the option –sampling-diversity. Note that all other analysis require the computation of the nng, meaning that the option –nng-builder should be used. Thus, a calculation is launched as follows:

> sbl-conf-ensemble-analysis-lrmsd.exe --points-file data/bln69_sampling.txt --sampling-diversity --pairwise-distances --nng-builder --num-neighbors 10 --mst --mtb --directory results --verbose --output-prefix --log

The main options of the program $\sblCEAL$ are:
–points-file string: plain text file listing all conformations as D-dimensional points
–sampling-diversity: run Sampling Diversity Analysis
–nng-builder: run the NNG Builder
–num-neighbors int: Target number of neighbors for each vertex
–mst: run Minimal Spanning Tree Analysis
–mtb: run Morse Theory Based Analysis

File Name

Description

BLN69 conformations file

Sampling done from sbl-landexp-hybrid-BH-TRRT-BLN.exe

Input files for the run described in section Input: Specifications and File Types .

Output: Specifications and File Types

Preview	File Name	Description
General: log file, sampling diversity and sampling sparsity
	Log file	Log file containing high level information on the run of $\sblCEAL$
	Analysis xml file	Global analysis serialized using Boost into an XML archive
Module pairwise distances
	Distances plain text file	Matrix of pairwise distances
Module NNG builder
	NNG xml file	NNG serialized using Boost into an XML archive
Module MTB analysis: persistence based clustering
	Morse Smale Witten chain complex xml file	Morse Smale Witten chain complex serialized using Boost into an XML archive
	Stable manifold partition xml file	XML archive listing the samples repartition by persistent basin
	Disconnectivity forest image file	Disconnectivity forest drawn in eps file format
	Sorted basins plain text file	List of all basins sorted by persistence
	Persistence diagram plot script	Gnuplot script for the persistence diagram
	Persistence diagram image	Persistence diagram drawn by gnuplot in pdf file format
	Persistences plain text file	List of all finite persistences
	Persistence histogram plot script	R script for the persistence histogram
	Persistence histogram image	Persistence histogram drawn by R in pdf file format

Output files for the runs described in section Input: Specifications and File Types, classified by modules – see Fig. fig-cea-workflow .

Algorithms and Methods

Sampling diversity

Computing these measures requires performing a registration of each conformation into a unique coordinate system, as specified in Eq. (eq-aligned-conformations).

Using the first conformation as reference, one can transform every other conformation by applying to it the rigid transform defining the $\lrmsd$ between and .

Sampling sparsity via spanning trees

Extracting the MST out of a connected graph is a classical problem in computer science. We use Prim's algorithm, which iteratively attaches one node not connected yet, namely that using the shortest edge available which does not create a cycle.

A minimum spanning tree (MST) connecting conformations
The MST is computed once all conformations have been registered in the same coordinate system. The distribution of edge lengths provides information on the sampling density.

Persistence Based Clustering

Once the density has been estimated (Eq. eq-estimated-density), the persistence based clustering is carried out using the algorithms implemented in the package Morse_theory_based_analyzer. See also the module MTBA in the workflow of section Algorithms and Methods.

Changing the representation used for conformations.
Changing the distance between conformations.

In order to derive such versions, there are two important ingredients, that are the workflow class, and its traits class, as we shall see now.

The Traits Class

T_Conformational_ensemble_analysis_traits:

The Workflow Class

T_Conformational_ensemble_analysis_workflow:

Table of Contents

Conformational_ensemble_analysis

Goals

Using sbl-conf-ensemble-analysis-lrmsd.exe

Pre-requisites

Sampling diversity

Sampling sparsity via spanning trees

Persistence based clustering

Computing pairwise distances

Input: Specifications and File Types

Output: Specifications and File Types

Algorithms and Methods

Sampling diversity

Sampling sparsity via spanning trees

Persistence Based Clustering

The Traits Class

The Workflow Class