Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Authors: F. Cazals and T. Dreyfus

Space_filling_model_coarse_graining

Goal: Designing Volume Based Coarse Grain Models

In this application, we present various volume based geometric approximation schemes, so as to design (hierarchical) coarse grain (CG) representations of a molecular geometric model. While our focus is on the preservation of molecular volumes, the geometric models generated can further be decorated with bio-physical properties.

Volumetric Approximations

Consider a molecular geometric model $S$ defined as a union of balls, e.g. the solvent accessible model of an atomic resolution system. Also assume that one has a geometric approximation scheme to approximate a collection of $n$ particles (atoms or pseudo-atoms) by a set of $k$ balls, with $k<n$. In section Algorithms and Methods , we shall actually deal with three such schemes:

  • Inner approximation. An approximation by default since the union of the $k$ balls defining the approximation is contained within the union of the $n$ input particles.
  • Outer approximation. An approximation by excess since the union of the $k$ balls defining the approximation contains the union of the $n$ input particles.
  • Interpolated approximation. An approximation such that its volume matches that of the union of the input particles. The shape defined by the $k$ balls is sandwiched between the inner and outer approximations, whence the name interpolated. Such an approximation is termed volume preserving.

As we shall see, volume preserving approximations of atomic models can be obtained using a number of balls $k$ equal to 2-5% of the number of atoms.

From Atomic to Coarse Grain Models

Our algorithms are of special interest to coarse grain a molecular geometric model $S$ represented by its solvent accessible model.
In this context, we actually provide several alternatives:

  • $\text{\ballcovorEU}$ (U for Uniform). The coarse-graining is carried out on the volume defined by the solvent accessible model of the molecular geometric model $S$. In doing so, large balls filling the middle of the shape are used – such balls typically cover several amino acids. This strategy is the one of choice if one wishes to compute a geometric approximation of a model, regardless of the size of the balls used.
  • $\text{\ballcovorEA}$ (A for amino-acids) The coarse-graining is carried out separately on each amino acid of the molecular geometric model $S$. The returned CG model is the union of these local CG models.
One could envision a strategy where one ball would be imposed for each $\Calpha$ carbon. Doing so would allow defining a covalent structure on the coarse-grain model.


  • $\text{\ballcovorEC}$ (C for Constrained) A class of models intermediate between $\text{\ballcovorEU}$ and $\text{\ballcovorEA}$. The coarse-graining is carried out independently on selected a.a., and on their complement. For example, to model cross-linking experiments, one may impose pseudo-atoms to model the exposed LYS separately.

Hierarchical Coarse Grain Models

The aforementioned algorithm can be used iteratively to produce a hierarchy of models. For example, in the context of reconstruction by data integration [5], one may build a hierarchy of coarse-to-fine models. In a first step, one uses the program $\text{\ballcovorEU}$ to compute an interpolated approximation of a molecule with N (=10000) atoms, using n (=1000) balls. Then, one iteratively feeds the balls of the approximation obtained to $\text{\ballcovorET}$ to obtain an ever simpler approximation of the same volume. For example, starting with N=10,000 atoms and selecting at each step 10% of the balls yields models of size 1000, 100 and 10.

Using Ballcovor: Coarse Graining a Family of 3D Balls

This section describes how to use the program $\text{\ballcovorET}$.

Pre-requisites: Volumetric Approximation Schemes

To present the workflow, we specify more carefully the geometric approximation schemes introduced in [44] to approximate a collection of balls by a smaller set of balls (Fig. Geometric approximations).

The following pieces of notation are used in the sequel: if $\calX$ refers to a collection of balls, $\domain{\calX}$ refers to the corresponding domain i.e. $\domain{\calX} = \cup_{B_i\in \calX} B_i$.

Inner approximation Given a 3D model $\domainO$ consisting of the union of $n$ balls $\calO$, find a domain $\domainS \subset \domainO$, defined by the union of $k<n$ balls $\calS$, such that the volume of $\domainO \backslash \domainS$ is minimized.


(Concentric) Outer approximation Given a 3D model $\domainO$ consisting of the union of $n$ balls $\calO$, find a domain $\domainS\supset \domainO$, defined by the union of $k<n$ balls $\calS$, derived from an inner approximation. The approximation is termed concentric if the balls used to define $\domainS$ are concentric with those of an inner approximation.


(Concentric) Interpolated approximation Given a 3D model $\domainO$ consisting of the union of $n$ balls $\calO$, find a domain $\domainS$ sandwiched between an inner approximation and the associated concentric outer approximation. The approximation is called volume preserving provided that $V(\domainS) = V(\domainO)$. It is termed concentric if the balls used to define $\domainS$ are concentric with those of an inner approximation.


Geometric approximations: inner / outer / interpolated.

Illustrations on a protein complex consisting of 9060 atoms (PDB id: 1fin)

In this package, we provide a general (and optimal) solution to problem pb-inner-coverering. However, we only address the design of concentric outer and interpolated approximations. The reason for doing so is twofold. First, defining an outer approximation by growing the balls of an inner approximation defines a so-called Toleranced Model (TOM), namely a one-parameter family of shapes obtained by linearly interpolating the radii of the balls between the inner and outer radii. Such models proved instrumental to model large protein assemblies plagued with uncertainties on the shapes and positions of the sub-units [35] , [75] , [76] . Second, as we shall see, growing tight inner approximations produces tight outer approximations.

Input: Specifications and File Types

The main input of $\text{\ballcovorET}$ is a collection of balls. The file format used is a text file listing the balls. A basic calculation is launched as follows:

> sbl-ballcovor-txt.exe -f data/spheres.txt --n-inner-balls 20 --outer --interpolated --directory results --verbose --output-prefix --log
The main options of the program $\text{\ballcovorET}$ are:
-f string: file listing the input family of 3D balls
–n-inner-balls int: number of balls to select in the inner approximation
–outer: run the outer approximation
–interpolated: run the interpolated approximation


Note that a default radius of $1.4 \AA$ is added to all atoms to define the Solvent Accessible Model of the input molecule.

File Name

Description

3D Spheres text file

A list of 3D spheres (a line contains the coordinates and the radius of one sphere)

Input files for the run described in section Input: Specifications and File Types .

Output: Families of 3D balls of the Approximations

PreviewFile Name

Description

General: log and visualization file

Log file

Log file containing high level information on the run of $\text{\ballcovorET}$

Click it Approximations VMD file

VMD visualization state file of inner, outer and interpolated covers

Module inner approximation

Inner approximation text file

Text file listing the balls in the inner approximation

Module outer approximation

Outer approximation text file

Text file listing the balls in the outer approximation

Module interpolated approximation

Interpolated approximation text file

Text file listing the balls in the interpolated approximation

Output files for the run described in section sec-ballcovor-using-balls-input .

Using Ballcovor: Coarse Graining Molecules

This section describes how to use the programs $\text{\ballcovorEU}$ and $\text{\ballcovorEC}$ in different contexts. The pre-requisites being identical to those of $\text{\ballcovorET}$, the reader is referred to section Pre-requisites: Volumetric Approximation Schemes .

Input: Specifications and File Types

The main input of $\text{\ballcovorEU}$ is a molecular model loaded from a PDB file. A basic calculation is launched as follows:

> sbl-ballcovor-pdb-U.exe -f data/1vfb.pdb --n-inner-balls 20 --outer --interpolated --directory results --verbose --output-prefix --log
The main options of the program $\text{\ballcovorEU}$ are:
-f string: PDB file of the molecule to coarse grain
–n-inner-balls int: number of balls to select in the inner approximation
–outer: run the outer approximation
–interpolated: run the interpolated approximation


File Name

Description

1vfb PDB file

PDB file of an Immunoglobuline-Antibody complex

Input files for the run described in section Input: Specifications and File Types .

Output: Families of 3D balls of the Approximations

PreviewFile Name

Description

General

Log file

Log file containing high level information on the run of $\text{\ballcovorEU}$

Click it Approximations VMD file

VMD visualization state file of inner, outer and interpolated covers

Inner approximation

Inner approximation text file

Text file listing the balls in the inner approximation

Outer approximation

Outer approximation text file

Text file listing the balls in the outer approximation

Interpolated approximation

Interpolated approximation text file

Text file listing the balls in the interpolated approximation

Output files for the run described in section Input: Specifications and File Types, classified by modules – see Fig. fig-ballcovor-workflow .

Example Application: Modeling Cross-linking Experiments

We provide one example application of the algorithm $\text{\ballcovorEC}$. Given a macro-molecular model, cross-link experiments report pairs of LYS which get cross-linked. In order for two LYS to cross-link, two conditions are necessary: these LYS must be exposed, and their distance must be compatible with the size of the cross-linker.

We can therefore use algorithm $\text{\ballcovorEC}$ to design CG models of a molecular model, so as to CG separately the exposed LYS and the atoms of all remaining a.a.

Visualization, Plugins, GUIs

The SBL provides VMD and PyMOL plugins to use the programs of Space_filling_model_coarse_graining . The plugins are accessible in the Extensions menu of VMD or in the Plugin menu of PyMOL . Upon termination of a calculation launched by the plugin, the following visualizations are available:

  • the inner approximation of the input molecule,
  • (optionally) the outer approximation of the input molecule,
  • (optionally) the interpolated approximation of the input molecule.

Algorithms and Methods

In this section, we sketch the algorithms used to compute the three geometric approximations.

Inner Approximation

The main ingredients of the inner approximation are as follows [44] :

  • the input consists of a collection of balls $\calO$ defining a region $\domainO$,
  • from domain $\domainO$, one computes a finite set of balls $\calC$ associated with the medial axis transform of the domain $\domainO$ – see package Union_of_balls_medial_axis_3, and defining a covering of $\domainO$, that is, one has $\domainO = \domainC$. It is important to note that in general, the set $\calC$ is different from the set of input balls $\calO$.
  • The algorithm consists of iteratively selecting from $\calC$ the ball providing the best volume increment — a selection done using a priority queue – see package Greedy_selection. Upon selecting a ball say $B_i$, we recompute the volume increments of the remaining candidate balls intersecting $B_i$.
Computing the volume of a union of balls is a difficult problem, from a combinatorial, but also numerical standpoint—inverse trigonometric functions are involved. We use our algorithm [43] which returns an interval certified to contain the exact volume – see package Union_of_balls_surface_volume_3.


That is the specification of the algorithm requires a collection of balls $\calO$ defining a region $\domainO$, a selection size $k$ or a target ratio $\tau$ between the volume of $\domainS$ and that of $\domainO$ — that is one expects $\volume{\domainS} / \volume{\domainO} \geq \tau$. The output consists of an ordered set of balls $\calS \subset \calC$, together with the increment in volume associated to each ball.

Lazy ball selection. One comment is in order about the incremental selection of the ball yielding the best increment, called the dominating ball in the sequel. We perform this operation using the following lazy strategy.

As a pre-processing step, we compute the intersection graph of the candidate balls, namely the graph with one vertex per ball $B_i\in \calC$, and one edge for every pair of intersecting balls. Incidences in this graph are used to identify the balls whose volume increment needs to be recomputed.

More precisely:

  • Step 0. Upon selecting a ball $B_i$, the remaining candidate balls intersecting $B_i$ are tagged as obsolete.
  • Step 1. We then pop the top ball from the priority queue and recompute its volume if it is obsolete.
  • Step 2. If this ball still yields the best increment, it gets selected. Else, the ball is re-inserted into the priority queue, and we move back to step (1).

Outer Approximation

Denote $\domainOB$ the bounary of the domain $\domainO$. To solve the outer approximation problem, we enlarge the balls from the set $\calS$, so as to cover the portion of $\domainOB$ sticking out from $\domainS$.

To see how, recall that the Apollonius diagram of a collection of balls ${B_i(c_i, r_i)}$ is the Voronoi diagram defined for the following generalized distance [27] :

$ \delta_i(p) = \mid\mid~{p-c_i}~\mid\mid - r_i $

.

Note that the Apollonius distance is merely the Euclidean distance to the sphere $S_i$ bounding $B_i$. For each ball $B_i$ of the selection $\calS$, consider the restriction of the boundary of the input domain not covered yet by the domain $\domainS$, to its Voronoi cell $\vorCellApo{B_i}$ in the Apollonius diagram.

If this restriction is non empty, we define the expansion radius $r_i^+$ of $S_i$ by the maximum distance of a point of that region to $S_i$, that is:

$ r_i^+ = \max_{p\in \partial\domainO\cap \vorCellApo{B_i} \text{ and } p\not\in\domainS} \delta_i(p) $

If this restriction is empty, the expansion radius is set to the original radius, that is $r_i^+ = r_i$. Expanding each ball of the selection $\calS$ by its expansion radius yields an outer cover of the input domain.

The class T_Space_filling_model_coarse_graining_outer_cover implements this computation without computing the partition of the boundary of the input object induced by the Apollonius Voronoi diagram.

Interpolated Approximation

Consider an inner approximation together with the associated concentric outer approximation, and denote $r_i$ and $r_i^+$ the inner and outer radii of the i-th ball.

Given a parameter $t\in [0, 1]$, we define the interpolated radius of the $i$th ball as

$ r_i(t) = (1-t)r_i + tr_i^+ $

An interpolated approximation is the union of these interpolated balls; it is called volume preserving if its volume matches that of the input shape.

Increasing the value of $t$ in Eq. eq-interpolated-radius yields nested balls, whence nested interpolated approximations. Therefore, finding the volume preserving interpolated approximation requires a binary search on $t\in[0,1]$. Practically, the binary search is stopped when the discrepancy between the volumes is less than $\eps_V=10^{-5}$.

The class T_Space_filling_model_coarse_graining_interpolated_cover implements this algorithm.

Problem specification: parameters
Input balls $\calO$ defining the domain $\domainO$ Selection size $k$ Boolean flag to enforce the connectivity of $\domainS$ Meshing precision $\eps_M$ for $\domainOB$ and $\domainSB$
Pre-processing
Compute:
– The Delaunay triangulation $\dtb$ of the input balls $\calO$, and the associated $0$-shape – the vertices of the boundary $\domainOB$ of $\domainO$ – the Delaunay triangulation $\dtv$ of these vertices, and the dual Voronoi diagram $\dtvs$
– the medial axis of $\domainOB$ using $\dtb$ and $\dtvs$, using the algorithm described in [9]
– the balls in $\calC$ defining the covering of $\domainO$, as specified by lemma ???
Is there a refereznce to the covering lemma ?
Inner approximation
– Select the set $\calS$ consisting of $k$ balls amidst $\calC$, using the greedy algorithm from section Inner Approximation
Optional: connectivity enforcement
– Add balls from $\calC \backslash \calS$ to enforce the connectivity of $\domainS$, as explained in section ???
Outer approximation
– Mesh the domains $\domainOB$ and $\domainSB$ with precision $\eps_M$, using the $\codec{Mesher\_3}$ mesher from the CGAL library.
– Compute the expansion radii of the balls in the selection $\calS$, as specified in section Outer Approximation
Interpolated approximation
– Compute the interpolated radii of the balls defining the interpolated approximation, as specified in section Interpolated Approximation
Computing the inner, outer, and interpolated approximations of a domain $\domainO$ defined as a union of balls: overview.
subsection sec-ballcovor-algos-candidate-balls The set $\calC$ of candidate balls
As recalled in section Inner Approximation, the candidate balls defining the set $\calC$ are only centered on the vertices of the medial axis.
subsection sec-ballcovor-algos-volume-selection The volume of the selected balls
More precisely, due to the impossibility to obtain a volume as an exact number type, whatever the kernel used (exact, inexact), the centers and radii of the candidate balls are converted to doubles. These balls are input to our algorithm, which requires two template parameters: the number type of the output (double or interval), and the level of exactness used to compute the constructions involved in the volume computation, namely the coordinates of Voronoi vertices, and boundary points of the union of the selected balls. Following the discussion in [43] , the three options are referred to as (faster, ck_pt_exact and all_exact).
Practically, we use the pair (double, faster) for the inexact kernel, and (interval, all_exact) for the exact kernel.


Programmer's Workflow

The programs of Space_filling_model_coarse_graining described above are based upon generic C++ classes, so that additional versions can easily be developed.

In order to derive such versions, there are two important ingredients, that are the workflow class, and its traits class.

The Traits Class

T_Space_filling_model_coarse_graining_traits:

The Workflow Class

T_Space_filling_model_coarse_graining_workflow: