Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Molecular_coordinates

Authors: F. Cazals and T. Dreyfus and T. O'Donnell

Introduction

This package discusses the representations used to represent molecular conformations. Such coordinates are cornerstones of the following tasks [157] , [174] , [97] :

  • Energy calculations: computing the potential energy of a (macro-)molecular system, using a so-called force field, see the package Molecular_potential_energy.

  • Structure minimization: minimizing the potential energy of a system by following its negative gradient.

  • Structure generation: generating a novel structure given a known structure – aka a move set.

Pre-requisites: representing molecular conformations

Internal coordinates

We consider a molecular graph, as introduced in the package Molecular_covalent_structure.

Internal coordinates represent the geometry of a molecule in terms of bond lengths, valence angles, and dihedral angles [141] .

Bond lengths.

Bonds are defined by two points connected in the molecular covalent graph~(Fig. fig-ic-bl-va (A)).

Valence angles.

Valence angles are defined by around a particle participating in two such bonds, thereby defining an angle (Fig. fig-ic-bl-va (B)).

Dihedral angles.

Dihedral angles come into two guises: proper and improper.

For proper angles, consider four consecutive atoms on a path: the dihedral angle in the angle between the two planes defined by the first three and the last three atoms (Fig. fig-ic-dh (A)).

For improper angles, consider a central atom $A_j$ connected to three atoms, say $A_i, A_k, A_l$. (Fig. fig-ic-dh (B)). Pick a second atom the define a hinge, e.g. $A_k$ to define the hinge $A_jA_k$. The improper dihedral angle is the angle between the planes $A_jA_kA_i$ and $A_jA_kA_l$ (Fig. fig-ic-dh (B) ). Note that an improper angle can be thought as the off planarity angle of atom $A_l$ with respect to the plane $A_jA_kA_i$.

The types used to represent IC are obtained from the class SBL::CSB::T_Molecular_covalent_structure. In particular, SBL::CSB::T_Molecular_covalent_structure::Particle_rep is used to define the types Covalent_structure::Bond_rep, Covalent_structure::Bond_angle_rep, Covalent_structure::Torsion_angle_rep.
A canonical i.e. unique representation of each internal coordinate representation can be obtained, as explained in the package Molecular_covalent_structure.


Bond length and valence angle
The bond length is the distance between two atoms. The valence angle is the angle at the apex of the triangle formed by two covalent bonds sharing a central atom.

Dihedral angle: proper and improper
A proper dihedral angle is defined by three consecutive atoms. An improper dihedral angle is defined by a central atom connected to three others: the angle is that defined by two planes sharing an edge of the tetrahedron involving the four atoms.
A proper dihedral angle is usually stored with the four tuple of atoms forming the path of length four. To store an improper angle, one must identify the central atom and the hinge. In $\text{\charmm}$ for example [158] , the semantics of the four tuple $(A, B, C, D)$ is as follow: $A$ is the central atom; the angle is that between the two planes $(A, B, C)$ and $B, C, D$. Note that this amounts to take the pair $(B, C)$ as hinge – even though it may not be a chemical bond.


Coordinates and degrees of freedom

Primitive internal coordinates refer to the set of all internal coordinates (bond lengths, valence angles, dihedral angles) defined from a covalent structure.


Internal coordinates are classically represented using a so-called Z-matrix (Z-matrix).

To understand subtleties between the various types of coordinates, the following notations will be useful:

  • $\numat$: number of particles.

  • $\numcc$: number of Cartesian coordinates. One has $\numcc = 3\times \numat$.

  • $\numdof$: number of degrees of freedom. For a non-linear molecules, removing rigid motions (3 degrees of freedom for rotation and translation), one gets $\numdof = 3\numat - 6$. (NB: we omit in the sequel the case of linear molecules.)

  • $\numic$: number of primitive internal coordinates. In general, one has $\numic > \numdof$. This set of coordinates depends on the topology of the molecule studied, see examples exple-ic-cyclobutane and exple-ic-fluoroethylene .

We now discuss several examples illustrating these definitions.

Consider the two molecules from Fig. fig-internal-coords-examples (A,B). Both involve $\numat = 4$, wich results in $\numcc=12$ and $\numic = \numdof = 12-6 = 6$. However, for case (A), these 6 internal coordinates are three bond lengths, two valence angles, and one dihedral angle. But for case (B), one gets three bond lengths and three valence angles. That is, the internal coordinates available depend on the graph topology.


Consider the cyclobutane from Fig. fig-internal-coords-examples (C). With 4 bond lengths, 4 valence angles, 4 dihedral angles, one gets $\numic = 12$. On the other hand, $\numdof = 4\times 3 - 6 = 6$. The following internal coordinates are sufficient to represent uniquely the geometry: four bond lengths, one valence angle, one dihedral angle.


Consider the fluoroethylene from Fig. fig-internal-coords-examples (D). It is easily seen that there are 5 bond lengths, 6 valence angles, 4 dihedral angles. For the latter, observe that once one has fixed the $C=C$ bond, one can form 4 tuples (whence four dihedral angles) by taking the Cartesian products of the groups of atoms bonded to the two carbons. Thus, $\numic = 15$. But since $\numdof = 3\times 6 -6 = 12$, it turns out that the 15 internal coordinates have three redundancies. They can be identified as explained in [15] .


Covalent structure and coordinates: Cartesian coordinates, internal coordinates, and degrees of freedom.
See examples exple-ic-one , exple-ic-cyclobutane and exple-ic-fluoroethylene for details. (A) A molecular graph with 4 covalent bond lengths, 2 valence angles, and 1 dihedral angle. (B) A molecular graph with 3 covalent bond lengths and 3 valence angles (C) Cyclobutane: 4 bond lengths, 4 valence angles, 4 dihedral angles. (D) Fluoroethylene: 5 bond lengths, 6 valence angles, 4 dihedral angles. Inset: indices of the atoms for the z-matrix representation of Fig. fig-z-matrix-fluoroethylene .

Z-matrix representation of fluoroethylene from Fig. fig-internal-coords-examples (D)
(From [15] .)
C2H3F
APtclcactv11091612353D 0   0.00000     0.00000
 
  6  5  0  0  0  0  0  0  0  0999 V2000
   -1.0606    0.1723    0.0001 F   0  0  0  0  0  0  0  0  0  0  0  0
    0.1319   -0.4627   -0.0005 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2458    0.2325    0.0001 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.1690   -1.5420    0.0030 H   0  0  0  0  0  0  0  0  0  0  0  0
    2.1991   -0.2751   -0.0004 H   0  0  0  0  0  0  0  0  0  0  0  0
    1.2087    1.3119    0.0010 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  2  0  0  0  0
  2  4  1  0  0  0  0
  3  5  1  0  0  0  0
  3  6  1  0  0  0  0
M  END
$$$$
Fluoroethylene in MOL/SDF format.Generated with Jmol (right mouse click / File / Save.

Internal coordinates: primitive, natural, delocalized

As illustrated by example exple-ic-fluoroethylene, the choice of a coordinate system to represent a molecule may be non trivial. It is even more so in the presence of cycles. The choice of the representation depends on the problem tackled, and is of paramount importance when geometric optimization (minimization of the potential energy) is carried out. The reader is referred to the excellent overview provided in the Q-Chem user manual (Q-Chem and more specifically here).

In short:

  • Primitive internal coordinates are encoded in the graph topology. As illustrated by example exple-ic-fluoroethylene, such coordinates are usually redundant, so that deciding of a non redundant set of coordinates does not admit a unique solution. See also [156] .

  • To reduce the coupling, both harmonic and anharmonic, between internal coordinates, natural internal coordinates were designed [147] , and algorithms to derive them proposed [84] . These algorithms remain complex, though.

  • Following the spirit of natural internals, yet with much simpler generation algorithms, delocalized internal coordinates were finally proposed [15] . In a nutshell, these coordinates, which stem for a linear transformation between redundant internal coordinates and Cartesian coordinates, are characterized as follows:
    • they form a complete and non redundant set of $\numdof$ internal coordinates,
    • each such coordinate is a linear combination of a number of all internal coordinates.

In the sequel, we focus on primitive internal coordinates (PIC) and delocalized internal coordinates (DIC).

Primitive internal coordinates (PIC)

Algorithm

Given a molecular graph, see package Molecular_covalent_structure, all primitive internal coordinates are generated as follows:

  • bond lengths: one iterates over the edges of the graph.

  • valence angles: one iterates over the pairs of consecutive edges of the graph.

  • proper dihedral angles: one iterates over the edges $e=(a,b)$ of the graph; for each edge, one takes the Cartesian products of the particles bonded to a and b, and generate a dihedral angle for each such pair. See for example the four dihedral angles in example exple-ic-fluoroethylene .

Representation

The primitive internal coordinates are computed using the class SBL::CSB::T_Molecular_primitive_internal_coordinates < ConformationType , CovalentStructure >, where ConformationType represents a conformation with Cartesian coordinates as defined in the package Molecular_conformation, and CovalentStructure represents a covalent structure as defined in the package Molecular_covalent_structure.

The class SBL::CSB::T_Molecular_primitive_internal_coordinates provides methods for computing the primitive internal coordinates of :

  • a given bond : SBL::CSB::T_Molecular_primitive_internal_coordinates::get_distance ,
  • a valence angle : SBL::CSB::T_Molecular_primitive_internal_coordinates::get_bond_angle ,
  • a proper dihedral angle : SBL::CSB::T_Molecular_primitive_internal_coordinates::get_dihedral_angle .
Consider the case of dihedral angles involving a double bond. Its cis-trans configuration is determined using the method SBL::CSB::T_Molecular_primitive_internal_coordinates::get_cis_trans_orientation returning an enum value (ISOMERISM_CIS, ISOMERISM_TRANS or ISOMERISM_UNDEF).


The class SBL::CSB::T_Molecular_primitive_internal_coordinates provides also methods to fill arrays (or other compliant data structures) with all primitive internal coordinates for :

  • bonds : SBL::CSB::T_Molecular_primitive_internal_coordinates::get_bond_distances ,
  • valence angles : SBL::CSB::T_Molecular_primitive_internal_coordinates::get_bond_angles ,
  • dihedral angles : SBL::CSB::T_Molecular_primitive_internal_coordinates::get_dihedral_angles .

From Cartesian coordinates to internal coordinates

Conversion

Equations.

Internal coordinates are defined as follows:

Bond length. The bond length between two particles $A_i$ and $A_j$ satisfies:

$ \dij = \sqrt{ (x_i-x_j)^2 + (y_i-y_j)^2 + (z_i-z_j)^2}. $


Valence angle. The valence angle at particle $A_j$ is defined by:

$ \theta = \arccos{ \frac{ \dotp{A_jA_i}{A_jA_k} } { \vvnormt{A_jA_i}\vvnormt{A_jA_k} } }. $


Dihedral angle (proper). Denoting $n_1$ and $n_2$ the normal vectors to the two planes defined by particles $A_1\dots A_3$ and particles $A_2\dots A_4$, the dihedral angle is defined by:

$ \Phi = \arccos \frac{\dotp{n_1}{n_2}} {\vvnormt{n_1} \vvnormt{n_2}}. $


Note that the orientations of $n_1$ and $n_2$ determine the sign of the dot product between $n_1$ and $n_2$, whence the value of $\Phi$.

The two possible orientations of $\Phi$ can be illustrated with cis-trans isomerism: in the trans configuration, the dihedral angle is expected to be $\pi$, while in the cis configuration, it is expected to be $0$.

From internal coordinates to Cartesian coordinates.

There exist several strategies to convert back internal coordinates to Cartesian coordinates, and these differ in several respects, namely the number and the type of floating point operations. As shown in [141] , the most efficient one in the so-called SN-NeFR method.

The Wilson B matrix

Displacements in internal and Cartesian coordinates are related by the so-called B matrix. To define it, consider:

  • $q$: $n\times 1$ vector of internal coordinates.

  • $X$: $\numcc \times 1$ vector of Cartesian coordinates.

Following [157] , [97] and the references therein, we define:

The Wilson $B$ matrix is the $n\times \numcc$ matrix defined as the Jacobian of the transformation converting Cartesian coordinates into internal coordinates, that is, denoting $d$ the differential used in multivariate calculus: $ dq = B\ dX. $


Practically, a row of matrix $B$ is obtained by differentiating the relevant equation ( Eq. (eq-bond-length) or Eq. (eq-valence-angle) or Eq. (eq-torsion-angle)) with respect to the Cartesian coordinates defining the variable associated with the row.

From internal coordinates to Cartesian coordinates

We cover several embedding operations:

  • Embedding a 4-rth atom using a dihedral angle
  • Embedding the $\Cbeta$ carbon using distances and valence angles

Embedding for a dihedral angles

The operation which consists of computing the Cartesian coordinates of one atom is called the embedding step. This operation requires a context, that is 3 atoms already embedded, with respect to which the new atom is positioned.

This operation is then repeated to embed all atoms.

The Nerf and SN-NeRF algorithms

In the following, we present the $\text{\snnerf}$ method, from [141].

Given a set of points $A_i$ with $i\in{1,2,3,4}$ with known embeddings for the first three and the relative position of the fourth ( $d_3,\theta_2,\tau_1$) (Fig. fig-nerf-embedding), the aim is to embed $A_4$.

The first operation plainly consists of using spherical coordinates in a suitable coordinate system centered at $A_3$}. (Nb: this coordinate system is called specialized reference frame in [7] .) This yields the following coordinates for $A_4^*$:

\begin{equation} A_4^*=(d_3cos(\theta_2),d_3cos(\tau_1)sin(\theta_2),d_3sin(\tau_1)sin(\theta_2)) \end{equation}

The second operation consists performing a rotation + translation to transform the previous coordinate system into that of the world/lab. The final position of point $A_4$ reads as

\begin{equation} A_4=RA_4^*+A_3, \end{equation}

with

$\hat{A}_{2-3}=\frac{A_2A_3}{|A_2A_3|}$ and $\hat{n}=\frac{A_{1}A_{2}\times\hat{A}_{2-3}}{|A_{1}A_{2}\times\hat{A}_{2-3}|}$

we obtain $R$:

\begin{equation} R=[\hat{A}_{2-3},\hat{n}\times\hat{A}_{2-3},\hat{n}] \end{equation}

Cartesian embedding of a point given (i) a context defined by three points/atoms, and (ii) a dihedral angle.
The $\text{\snnerf}$ algorithm (Self-Normalizing Natural Extension Reference Frame) [141] simplifies the $\text{\nerf}$ method by avoiding normalization operations on vectors. Parallel versions have also recently been proposed, see [7], [18], and [6].


Iterative embedding

We use the previous operation to iteratively embed a molecule, given all internal coordinates.

Initialization. The initialization consists of embedding three particles connected in a path, using two distances and an angle in an arbitrary Cartesian reference frame. In our case (Fig. fig-from-ic-to-cc):

  • $A_1(0,0,0)$
  • $A_2(0,0,d_1)$
  • $A_3(0,sin(\theta_1)*d_2,d_1-cos(\theta_1)*d_2)$

Iterative embedding of the remaining particles. The embedding of the remaining particles in the coordinate system defined by the first three is computed while performing a traversal of the molecular covalent graph. To describe the algorithm, we use the notion of context, namely three connected atoms which have already been embedded, and with respect to which the new particle is positioned.

The traversal is performed using two stacks: one corresponding to the points to be embedded next; the second one refers to the contexts (one for each atom to be embedded).

Note that the initialization makes it possible to stack all the neighbors of the first three atoms – their context is defined by these three atoms.

Then, the algorithm proceeds iteratively as follows:

  • The particle on the top of the stack is popped, and is embedded using its context. The particles linked to it by an edge in the covalent graph and not already visited are stacked, together with their context.
  • Each time an embedded particle is stacked it is tagged to avoid processing twice.

The process terminates when the stacks are empty.

Conversion from internal to Cartesian coordinates. In this illustration the graph is traversed from left to right and each colored particle is embedded using the previous three as context and the three internal coordinates with the same color.
For graphs with multiple connected components (c.c.), the process is iterated for each c.c.


The difference between NeRF and $\text{\snnerf}$ is that when computing $\hat{n}\times\hat{A}_{2-3}$, no normalization is used as it equals 1 by construction. Whence the Self-Normalizing Natural Extension Reference Frame name. Also the norm $|A_2A_3|$ used when embedding a point is stored as the points of the covalent graph are embedded such that it is not recomputed.


Implementation: main class. The conversion is performed using the operator of class SBL::CSB::T_Molecular_cartesian_coordinates. The template arguments are the Conformation type and the Covalent structure. As parameter for the operator the covalent structure and the internal coordinates are required (bond lengths, bond angles and dihedral angles).

Implementation: utilities. Static functions are defined in SBL::CSB::Molecular_coordinates_utilities. These can be used to compute bond and dihedral angles given vectors and to embed a particle given three others and internal coordinates as done in $\text{\snnerf}$ .

Embedding the Cbeta carbon atom

To graft the $\Cbeta$ carbon of a side chain on a known backbone, one uses the following internal coordinates:

  • The three valence angles $\theta_N = \angle N\Calpha C_\beta, \theta_C = \angle C\Calpha C_\beta, \theta = \angle N\Calpha C$. The half-angle is denoted $\theta' = \theta/2$.
  • The bond length $d=\vvnorm{\Calpha\Cbeta}$. We also denote $d_N = \vvnorm{\Calpha N}, d_C=\vvnorm{\Calpha C}$.

Analytically, define the following –see [odonnell2022modeling] :

\begin{align} t^2 = &+4*\cos\theta'^2*\cos\theta_N*\cos\theta_C - 4*\sin\theta'^4 - 4*\cos\theta'^2 - \\ &\cos\theta_N^2 - 2*\cos\theta_N*\cos\theta_C - \cos\theta_C^2 + 4 \end{align}

One obtains the two possible embeddings for $\Cbeta$:

\begin{equation} \begin{cases} y &= \frac{-d*(\cos\theta_N - \cos\theta_C)}{2*\sin\theta'}\\ z &= \frac{-d*(\cos\theta_N + \cos\theta_C)}{2*\cos\theta'}\\ x_{\pm} = \frac{d * t_{\pm}} { \sin\theta} \end{cases} \end{equation}

To graft a side chain onto a backbone, an option to position the $\Cbeta$ is to use an improper dihedral angle with respect to the plane $N \Calpha C$. However, because in force fields the force constants are much stiffer for valence angles (and bond lengths), the solution which respects two valences angles is preferable.


Placing the $\Cbeta$ carbon. Using three valence angles and one distance yields two solutions, the correct one being selected by a chirality argument.

The correct solutions is selected by chirality, noting that the volume of the tetrahedron defined by $N, C, \Calpha, \Cbeta$ is positive – in known structures.

Delocalized internal coordinates (DIC)

Algorithm

To derive the DIC, we follow [15] :

  • 1. Compute $G = B B^t$
  • 2. Extract the eigenvector/values of $G$. Denote $m$ the num. of $>0$ eigenvalues. Typically, $m=\numdof = 3\numat - 6$.

  • Denote $U$ and $R$, respectively, the eigenvectors associated to strictly positive and null eigenvalues. The eigen equation of $G$ satisfies

    $G (U R)= (U R) \left[ \begin{array}{c|c} \Lambda & 0 \\ \hline 0 & 0 \end{array} \right]$


    One may say that:

    • $U$: spans the non redundant subspace of the original space of primitive coordinates
    • $R$: spans the redundant subspace.

And their derivatives.

For the gradient in internal coordinates, see [157] , [97] and the references therein.

subsection mol-coords-dic-rep Representation
matrices from eigen


Implementation

Delocalized internal coordinates are not yet implemented.


Later: rep is using matrices
DIC.
And their derivatives.


Applications

This package offers also offers sbl-coordinates-converter.exe and sbl-coordinates-file-converter.exe to perform various conversions of coordinates.

Converting IC to CC : back and forth

The application sbl-coordinates-converter.exe provides the following conversions:

  • Converting Cartesian coordinates into internal coordinates, and
  • Converting internal coordinates into Cartesian coordinates,
  • converting XTC files s (if Gromacs is installed) into the Point-d file format used.

We note in passing that our encoding of internal coordinates is based on atom ids, as provided in PDB files.


Converting Cartesian coordinates to internal coordinates. Given a PDB file, this executable generates all IC (bond lengths, valence angles, dihedral angles):

sbl-cartesian-internal-converter.exe --pdb-file data/ala5.pdb

This call simply generates one txt file for each type of coordinate:

==> ala5_bonds.txt <==
#Chain X
#Atom ids bond length(Angstrom)
1 5 1.45994
5 11 1.51006

==> ala5_bond_angles.txt <==
#Chain X
#Atom ids bond angle(radii)
5 1 2 1.90268
5 1 3 1.90217

==> ala5_dihedral_angles.txt <==
#Chain X
#Atom ids dihedral angle(radii)
2 1 5 11 0
2 1 5 7 0.979966


Converting internal coordinates into Cartesian coordinates: all coordinates provided. A simple case is that were all IC are provided. In that case, one needs to provide

  • A pdb file which is used to learn the topology of the molecule. (Nb: the Cartesian coordinates are of no interest.)
  • Three files respectively providing the bond length, valence angles, and dihedral angles:
sbl-cartesian-internal-converter.exe --pdb-file ala5.pdb  --bond-lengths-file  ala5_bonds.txt --bond-angles-file ala5_bond_angles.txt  --dihedral-angles-file ala5_dihedral_angles.txt

This call simply generates a txt file containing the Cartesian coordinates – ala5_cartesian.txt in the previous example.


Converting internal coordinates into Cartesian coordinates: selected coordinates missing. In case selected IC are missing, a force field can be passed so as to use the equilibrium values of the corresponding models. For example, using $\text{\amber}$:

sbl-cartesian-internal-converter.exe --pdb-file ala5.pdb --bond-angles-file ala5_bond_angles.txt  --dihedral-angles-file ala5_dihedral_angles.txt --force-field-file data/amber-ff14sb.xml
As noted above, IC are encoded using atom ids. Developers can also use the vertex ids used in the molecular graph. The reader is referred to the Molecular_coordinates package for more details.


Conversion xtc – point_d format

The executable sbl-coordinates-file-converter.exe convertes xtc files to the Point_d format. This conversion is useful by several executables from the SBL – see e.g. the package Landscape_explorer .