Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.

Authors: S. Marillet and P. Boudinot and F. Cazals
This package provides tools to (i) select binding affinity prediction models from atomic coordinates, typically crystal structures of the partners and/or of the complex, and to (ii) exploit such models to perform affinity predictions.
At the selection stage, given a set of predefined variables coding structural and biochemical properties of the partners and of the complex, the approach consists of building sparse regression models which are subsequently evaluated and ranked using repeated crossvalidation [102] .
At the exploitation stage, the variables selected are used to build a predictive model on a training dataset, and predict the affinity on a test dataset.
More precisely, the approach runs through 3 steps:
Consider two species and forming a complex . The dissociation is defined by , namely the ratio between the concentration of individual partners and bound partners.
The strength of this association is quantified by the dissociation free energy , which, in the standard state, satisfies with Boltzmann's constant and the temperature:
This equation shows that has two components respectively coding the enthalpic ( ), and entropic ( ) changes upon binding. As stated above, we wish to estimate these quantities from atomic coordinates, typically crystal structures of the partners and/or the complex.
Parameters
The parameters used in this package are based on the following key geometric constructions:
Using the previous notions, we define four categories of atoms of a complex:
Biophysical rationale.
Before defining our parameters, we raise several simple observations underlying their design:
Using the quantities defined in section General prerequisites, we define the following variables:
Interface terms.
Packing terms.
Consider the volume variation of , see Eq. (eqatomvolumevariation). As recalled above, packing properties are important in several respects. Adding up volume variations yields the following four sum of volumes differences (VD) parameters:
Solvent interactions and electrostatics.
The interaction between a protein molecule and water molecules is complex. In particular, the exposition to the solvent of nonpolar groups hinders the ability of water molecules to engage into hydrogen bonding, yielding an entropic loss for such water molecules. We define the following terms:
Parameters used to estimate binding affinities.
There are three groups of parameters (see the next equations for their definition):
The acronyms read as follows (see text for details):
For variables describing a change between the bound and unbound forms of the partners e.g. or , it is necessary to match the atoms of the bound partners to those of the unbound partners. For this, the aminoacid sequences of the partners are extracted from the PDB files and aligned. Then, the atoms of every pair of corresponding residues are matched using their name.
Since missing residues and missing atoms are common in crystal structures, the proportion of matched atoms is computed as to investigate potential wrong matchings, and discard entries for which the alignment is not good enough.
In this section we explain how to compute the variables presented in section Associated variables .
This step consists of two substeps:
The input of the preprocessing step consists in:
An XML file complying with the following format:
Step 1.
The first step of the workflow is performed by the script which is called as follows:
sblbapstep1runapplications.py i data/example_data.xml s data/eisenberg_solvation_parameters.txt p data/structures n 1 d results
File Name  Description 
Plain text annotations file  Eisenberg Solvation Parameters 
XML file  PDB ids and chains of complexes and unbound partners 
PDB file  Example PDB file in the input directory of option p 
Step 2.
The second step of the workflow is performed by the script which is called as follows:
sblbapstep2compilemoleculardata.py f data/example_data.xml d results o results/affinity_dataset.xml
Step 1.
Practically, this first step consists in running (through SBL::PDB_complex_to_unbound_partners_matchings::PDB_complex_to_unbound_partners_matcher ), and on various entries of the dataset. For each entry, a directory named after the following scheme is created in the current working directory: <PDB_ID>_<chains_A>_<chains_B>
Each directory is structured as follows:
1A2K_AB_C  1OUN_A_A_atom_matchings.txt  1OUN_A_A_residue_matchings.txt  1OUN_AB   sblvorlume_eisenberg_solvation_parameters__log.txt   sblvorlume_eisenberg_solvation_parameters__surface_volumes.xml  1OUN_B_A_atom_matchings.txt  1OUN_B_A_residue_matchings.txt  1QG4_A   sblvorlume_eisenberg_solvation_parameters__log.txt   sblvorlume_eisenberg_solvation_parameters__surface_volumes.xml  1QG4_C_A_atom_matchings.txt  1QG4_C_A_residue_matchings.txt  sblvorlume_eisenberg_solvation_parameters__log.txt  sblvorlume_eisenberg_solvation_parameters__surface_volumes.xml  sblvorshellbpABWatomic__AB_ball_shelling_forest.dot  sblvorshellbpABWatomic__AB_ball_shelling_forest.xml  sblvorshellbpABWatomic__AB_packing.xml  sblvorshellbpABWatomic__BA_ball_shelling_forest.dot  sblvorshellbpABWatomic__BA_ball_shelling_forest.xml  sblvorshellbpABWatomic__BA_packing.xml  sblvorshellbpABWatomic__interface_AB.pdb  sblvorshellbpABWatomic__interface_BA.pdb  sblvorshellbpABWatomic__log.txt  sblvorshellbpABWatomic__surface_volumes.xml
This example is for entry 1A2K with partners AB and C. All *_matchings.txt
files contain the matchings between atoms or residues of the bound and unbound structures. Directories 1OUN_AB
and 1QG4_A
are for the unbound partners. All sbl*
files contain the results of the SBL executables. If verbose output is toggled (v
flag), various information will be output on the standard and error output.
Step 2.
Preview  File Name  Description 
XML archive  Compiled results of runs of step 1 
In this section, we explain how to select a binding affinity model maximizing a performance criterion, using the script . Two points worth noticing are:
In the sequel, we explain how to predict of complexes from a dataset , using the methodology summarized on Fig figbapworkflow . Estimation is performed on a per complex basis, from which performances at the whole dataset level are derived.
Our predictions rely on three related concepts defined precisely hereafter (see also Fig figbapworkflow):
Templates.
Denote the pool of variables. Let a template be a set of variables, i.e. a subset of . To define parsimonious templates from the set , we generate subsets of involving up to at most variables. This defines a pool of templates .
Crossvalidation.
In the following a model is associated to both a template and a dataset . More precisely, a model refers to a regression model, i.e. the variables of the template plus the associated coefficients.
Practically, models are defined during fold crossvalidation, and a number of of repetitions (see Fig figbapworkflow). Consider one repetition, which thus consists of splitting at random into subsets called folds. For one fold, a regression model associated with is trained on (k1)/k of the dataset , and predictions are run on the remaining 1/k of complexes. Processing the folds yields one repetition of the cross validation procedure, resulting in one prediction for the of each complex. The set of all predictions in one repeat, say the th one, is denoted
Note again that these predictions stem from regression models associated with , namely one per fold.
Statistics per template.
Considering one crossvalidation repetition , we define the correlation as the correlation between the experimental values and the predictions . An overall assessment of the template using the repetitions is obtained by the following median of correlations
For a complex , we define the binding affinity prediction as the median across repetitions i.e.
The median prediction error is defined by
and the median absolute prediction error by:
Using this latter value, we define the prediction ratio as the percentage of cases such that the dissociation free energy is off by a specified amount :
In particular, setting to 1.4, 2.8 and 4.2 kcal/mol in the previous equation yields cases whose is approximated within one, two and three orders of magnitude respectively.
Finally, a permutation test yields a pvalue for each predictive model [119] .
In a nutshell, the rationale consists of generating randomized datasets by shuffling their values. Then, one computes the performance criterion for each such dataset, from which the pvalue is inferred (see Algorithm~algpval).
Model selection.
Define the best predictive model as the one maximizing performance criterion. We wish to single out the best predictive models, i.e. those that cannot be statistically distinguished from the best predictive model. More precisely, consider a pool of predictive models , out of which we wish to identify a subset of predictive models whose distribution cannot be distinguished from that of the best predictive model. To this end, let H0 be the null hypothesis stating that two predictive models yield identical performance distributions. We decompose the predictive models as such that :
The predictive models in are called the specific models for the dataset . The corresponding procedure is based on the KruskalWallis test, see [102] .
The pvalue threshold for this test is set to .
Running binding affinity predictions for a dataset i.e. a subset of the structure affinity benchmark: graphical outline of the statistical methodology. (Crossvalidation) Each template undergoes a number of repetitions of crossvalidation, yielding one binding affinity prediction per complex for each repetition. (Statistics) Various statistics are computed to assess the performances yielded by the predictive model associated to each template. (Model selection) Predictive models are compared, and the best ones selected. 
Model generation and selection is performed by the script which is executed as follows :
# Quick example: maximum 2variables models, 10 repetitions and pvalue permutations, # flexible complexes only (irmsd > 1.5 A), run on a single process sblbapstep3modelsanalysis.py f results/affinity_dataset.xml m 2 n 1 r 10 p 10 l 1.5 o flex_quick
# Heavier example: maximum 4variables models, 100 repetitions and # 1000 pvalue permutations, all complexes, run on three # processes. Performs knearest neighbors regression with 10 neighbors sblbapstep3modelsanalysis.py f results/affinity_dataset.xml m 4 n 3 r 100 p 1000 k 10 o all_heavier
The XML file specified by option f should comply with the following format:
The txt file specified by option s should comply with the following format:
File Name  Description 
XML archive file  Compiled data generated from 
This step outputs a minimum of seven files:
Files 1 and 2 are text files with the following structure:
In the first one, the index and the ranks are computed on the crossvalidated median absolute error of the model. In the second they are computed on the median crossvalidation correlation coefficient.
Files 3 and 4 below contain a matrix of prediction per crossvalidation repetition for the best model. Here "best" refers to the model selected by the model selection procedure described above. If multiple models are selected, each one has an associated file. These files are structured as follows:
File 5 and 6: PDF files containing scatter plots of the prediction made by the best predictive models (from files 1 and 2 respectively) for each complex against the experimental affinity value.
Preview  File Name  Description 
Plain text file  File 1  
Plain text file  File 2  
Plain text file  File 3  
Plain text file  File 4  
PDF file  Scatter plot of the prediction versus experimental affinity for the best model(s)  
PDF file  Scatter plot of the prediction versus experimental affinity for the best model(s)  
PDF file  Scatter plot of the prediction error versus interface RMSD 
Once a set of variable has been selected during the previous step, it can be used to estimate a model. Recall that a model is a function returning an estimate for binding affinity of a complex. For this, one needs a training set and a regressor, i.e. an object able to use the training set to predict the affinity of a new dataset (for instance, linear leastsquares fitting, nearest neighbors, regression trees, ...). An example of the procedure to train such a regressor on the whole Structure Affinity Benchmark and predict its own entries (that is to perform predictions without crossvalidation) is given in a demo file (sblbindingaffinitymodelexploitationexample.py).
We now describe two algorithms used by this package. The first simply computes an approximate (upper bound on the) permutation pvalue for a given statistic [119] . The pseudocode is given on Algorithm algpval .
The second algorithm is used to divide a pool of models into two groups, namely , as discussed in the section Using the programs to select affinity prediction models . The corresponding procedure is based on the KruskalWallis test, and can be found in [102] .
We describe successively the 2 steps of section Using the programs to preprocess structural data .
Step 1.
The script runs applications from the SBL on the bound and unbound partners.
The matching between atoms (section Atoms' matching) is done using the python module SBL::PDB_complex_to_unbound_partners_matchings which is a wrapper around the executable . This executable performs the matching using the class SBL::CSB::T_Alignment_engine_sequences_seqan, from the package Alignment_engines, which itself uses the Seqan library to perform the sequence alignment, and subsequently match residue and atoms based on the alignment found.
In order to compute shelling orders and packing properties, the executables and are also run on the complex, and is run on the unbound partners.
Step 2.
The script compiles the previous results and computes the variables listed in section Associated variables for each entry of the dataset.
Data structures to store the properties of entities such as atoms, residues, chains and files are grouped in the python module SBL::Protein_complex_analysis_molecular_data_structures . To fill these data structures it also provides a class used to traverse the directory structure and parse the files resulting from step 1.
The machinery used to build, evaluate and rank various models using the previously computed variables can be found in the python module SBL::Combinations_of_variables_model_evaluation::Combinations_of_variables_model_evaluation .
Once a set of variables has been selected, model fitting on a training set and prediction of a new dataset can be done using any software or language. An example script is given using python and scikitlearn in file
The seqan library [56] (BSD, Seqan) is necessary when computing the variables which encode the variations between bound and unbound partners.