Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Authors: D. Agarwal and C. Caillouet and F. Cazals and D. Coudert

Connectivity_inference

Introduction

Motivations

Connectivity inference (CI). Consider a macro-molecular assembly consisting of subunits (typically proteins or nucleic acids). Assume that these subunits are known, but that the pairwise contacts between them are not. Connectivity Inference (CI) aims at inferring these unknown contacts (Fig fig-lego-example).

To address CI, let an oligomer be a small assembly dissociated from the intact assembly. An oligomer formula is the list of subunits involved in an oligomer. That is, an oligomer formula is the description of the composition of the oligomer, giving the number of instances of each molecule. We define a connectivity inference specification (specification for short) as a list of oligomers.

Oligomers typically come from native mass spectrometry experiments on a solution containing the assembly of interest, under various experimental conditions [167]. In particular, stringent conditions (e.g. low pH) result in complete dissociation of the assembly, so that the individual molecules are identified, while less stringent conditions result in the disruption of the assembly into multiple overlapping complexes.

Additionally, one may use an interaction likelihood qualifying the plausibility of the interaction of two subunits. On the experimental side, various assays have been developed to check whether two proteins interact, including yeast-two-hybrid, mammalian protein-protein interaction trap, luminescence-based mammalian interactome, yellow fluorescent protein complementation assay, etc. [30]. On the in silico side, various interactions attributes can be used, such as gene expressions patterns (proteins with identical patterns are more likely to interact), domain interaction data (a known interaction between two domains hints at an interaction between proteins containing these domains), common neighbors in protein-protein interaction networks, or bibliographical data (number of publications providing evidence for a particular interaction).

We view a CI problem as a problem on an unknown graph $G$. The nodes of this graph are the subunits of the assembly. The edges of the graph correspond to the pairwise contacts between these subunits. (NB: edges and contacts are used interchangeably in the sequel.) An oligomer is the vertex set of a connected subgraph of $G$. Thus, in solving a CI problem, we know the vertices of the graph $G$, and wish to discover its edges. Note that this is a non trivial task, since in particular, the number of interfaces between the subunits in the assembly–or equivalently the number of edges, is unknown.

This application provides algorithms solving CI problems, in the following sense:

Consider a set of contacts $E$ produced by an algorithm solving CI. This set of contacts (i.e. edges) is called valid provided that for each oligomer and also for the whole complex, restricting the edges to the vertices of an oligomer formula yields a connected graph. A valid edge set is also called a solution.


If an oligomer is a dimer, then, one edge in the solution set must connect its two subunits.


An edge in a solution must connect to subnits found in at least one oligomer.


Connectivity Inference: Unweighted and Weighted Versions

We actually provide two ways to solve CI problems:

  • In the unweighted version, the input only consists of a list of oligomers, and possibly labels to enforce/forbid contacts.
  • In the weighted version, one is also given weights to favor/unfavor specific contacts.

Unweighted case and MCI problems. In this setting, the input consists of the list of subunits, of a list of oligomers, and possibly labels to enforce/forbid contacts. The output consists of a list of solutions. Again, note that in general, the solution is not unique.

We wish to find solutions without speculating on the unknown number of contacts in the assembly. We therefore define the Minimum Connectivity Inference problem (MCI) as a variant of CI such that valid edge sets of minimum cardinality are sought. This cardinality is denoted $k$ in the sequel. The following comments are in order:

  • At the oligomer level, the minimum number of edges sought is the number of subunits in the oligomer minus one – the edges define a tree connecting the individual molecules. In solving a MCI problem, we actually wish to combine solutions for all oligomers, so as to obtain the smallest size $k$ compatible with all the oligomers–a difficult task.
  • Solving a MCI problem may sound artificial since in practice, the assembly studied may contain more edges than the aforementioned minimum size $k$. In fact, MCI is merely a well posed mathematical problem used to find solutions. At the end of the day, the user is indeed returned plausible edges obtained by combining all optimal solutions of size $k$, as we shall see below.

Weighted case and MWCI problems. Assume, as discussed above, that one also has an interaction likelihood, namely a number in the range $[0,1]$, qualifying the plausibility of a contact between two subunits. In that case, one wishes to find solutions taking into account the number of edges, but also their likelihood. We do so using a mixed criterion striking a balanced between these criteria. We call the resulting optimization problem Minimum Weight Connectivity Inference (MWCI), see below. The balance between the relative importance of the number of contacts and that of the weights is governed by a user defined parameter $\alpha\in [0,1]$. As we shall see later, in using $\alpha=1$, weights play no role.

As seen from the previous discussion, MWCI problems generalize MCI problems. For this reason, we use the following terms intercheangeable: MCI problems, or MWCI problems with $\alpha=1$.


MCI and MWCI Toy Examples
Top row: the unweighted case. One wishes to find contacts between subunits given the composition, in terms of subunits, of three oligomers. This task is carried out in two steps: first, optimal solutions i.e. solutions using as few edges as possible are sought; second, the edges used in all these solutions are collected and analyzed.

Note that in addition to the oligomers, the input may also involve labels to enforce/forbid contacts.

Bottom row: the weighted case In addition to oligomers, one may specify weights to favor/unfavor specific contacts. A value $\alpha \in [0,1]$ is then needed to balance the relative importance of the number of edges and their weight.

Getting Started: the Lego Example

We illustrate the input and output of a CI problem on our simple lego problem (Fig. fig-lego-example). The reader is referred to the subsequent section for further details.

The CI problem is specified by the list of subunits and oligomers:

  • lego-example.txt : specification of the $\mci$ problem.
    # A toy example Tetris blocks
    #Number of block types 5
    #Number of instances per block type - 1
    BEGIN-TYPES
    orange purple green cyan gray
    END-TYPES
    # List of oligomers
    BEGIN-OLIGOMERS
    orange purple green
    gray purple green
    orange purple cyan
    END-OLIGOMERS
    # Labels on edges: {F: forbidden, E: enforced, P: possible (default)}
    BEGIN-CONTACTS-LABELS
    purple cyan F
    END-CONTACTS-LABELS
    # Weights on edges
    BEGIN-CONTACTS-WEIGHTS
    orange green 0.9
    gray purple 0.1
    END-CONTACTS-WEIGHTS

Upon running the following command:

> sbl-mwci.py -f ../demos/data/lego-example.txt -c

One gets the list of two optimal solutions:

  • sbl-mwci__f_lego-example__alpha_1.0__all_solutions.txt : Output returned by the program $\text{\mwciE}$:
    Total number of solutions: 2
    Optimal solution 1 : (cyan,orange) (gray,green) (green,purple) (orange,purple)
    Contacts per optimal solution : 4
    Cost of optimal solution : 4
    Running time in seconds: 0.027874
    Optimal solution 2 : (cyan,orange) (gray,purple) (green,purple) (orange,purple)
    Contacts per optimal solution : 4
    Cost of optimal solution : 4
    Running time in seconds: 0.002767

Solving MCI Problems: example of the Yeast Exosome

In this section, we formally define the MCI problem, and show how to process a real example, the yeast exosome (Fig. fig-yeast-exosome).

Pre-requisites

Candidate edges and labeled edges

Consider a set of oligomers $\calO$. Solving the MCI problem requires a set of candidate edges to choose from, called the pool, and denoted $\poolEAll$. We provide two options to define it.

To specify the first option (which is the default one), consider a given oligomer $o$ consisting of say $m$ subunits. This oligomer defines a set of $\binom{m}{2}$ edges corresponding to all possible contacts between subunits. Thus, given a collection of oligomers, the default pool of edges is the set of all edges defined by all oligomers.

The second option consists of annotating the edges from the default option: selected edges are forbidden, selected edges are enforced, the remaining ones are possible. This specification is achieved by means of one label per edge:

  1. F: forbidden.

  2. E: enforced (NB: this is equivalent to specifying a dimer containing these two subunits).

  3. P: possible (default)

While enforcing edges is natural when two subunits are known to interact, the possibility to rule out edges is also frequently used. One common situation is the connectivity inference for an assembly for which a cryo-EM map is known, and for which two subunits can be located in the map without ambiguity. If these subunits are located far apart, then, the corresponding edge should be forbidden.

The specification of labels also makes it possible to specify the possible neighbors of a given subunit by ruling out selected edges involving that subunit.

If too many edges are tagged as forbidden, the connectivity problem may not have a solution.


Optimal Solutions, Consensus Solutions, and Consensus Edges

About MCI. As should be clear from the previous discussion, MCI or MWCI are not the problems solved per se. These are two well posed problems, for which all optimal solutions are enumerated. The relevant information stems from edges (contacts) collected across all such solutions.


As noticed above, a MCI problem does not admit, in general, a unique solution. Therefore, our strategy consists of enumerating all optimal solutions, from which various pieces of information are returned to the user. We define:

Consider a solution set $\solSet$, i.e. a set of optimal solutions for a MCI problem. Given such a set, we define:
– the score of a contact is the number of solutions from $\solSet$ containing it,
– the top scoring contacts are those achieving the maximum score,
– the score of a solution is the sum of scores of its contacts,
– the consensus solutions are the highest scoring solutions in $\solSet$,
– the consensus edges are the edges used in consensus solutions of $\solSet$.


Solutions and their Assessment

Sensitivity, specificity, coverage. Connectivity inference problems are generally solved when contacts are sought. On the other hand, reference contacts may have been obtained from biophysical and biochemical experiments such as x-ray crystallography, tandem mass spectrometry or cross linking experiments. We denote $\contactsRef$ such a set of contacts. In the sequel, we report methods to compare the contacts reported by our algorithms against these reference contacts.

The reference set $\contactsRef$ together with the pool $\poolEAll$ define positive ( $P$), negative ( $N$), and missed contacts ( $M$). From these groups, one further classifies the edges of a predicted solution in set $\solSet$ into four categories, namely true positive ( $TP$), false positive ( $FP$), true negative ( $TN$), and false negative ( $FN$).

Positives ( $P$) and negatives ( $N$) decompose as $P = TP+FN$, and $N=TN+FP$, from which one defines the sensitivity and the specificity as follows:


$Sensitivity = \frac{\size{TP}}{\size{P}}$, $Specificity = \frac{\size{TN}}{\size{N}}$

Note that specificity requires the set $N$ to be non empty, which may not be the case if $\poolEAll$ $\subset$ $\contactsRef$.

We also combine the previous values to define the following coverage score, which favors true positives, penalizes false positives and false negatives, and scales the results with respect to the total number of reference contacts (since $P$ might be included into $\contactsRef$ if the pool size is too small):

$Cov(\solSet) = \frac{\size{TP} - (\size{FP}+\size{FN})}{\size{\contactsRef}}$

Note that the maximum value is one, and that the coverage score may be negative.

Signed Contact score. Given a set of reference contacts, the signed contact score is the contact score multiplied by $\pm 1$ depending on whether this contact is a true or false positive w.r.t $E_{Ref}$.

The Yeast Exosome
The yeast exosome, an assembly consisting of 10 subunits. The Connectivity Inference problem consists of inferring contacts between the subunits from the composition of oligomers, i.e. connected blocks of the assembly.

(Left) Crystal structure (Right) The solid edges reported by the program $\text{\mwciE}$, while the dashed edges are not present in the solution.

Input: Specifications and File Types

The options offered by the program $\text{\mwciE}$ can be obtained by running $\text{\mwciE\ --help}$. The most important options are the following ones:

The main options of the program $\text{\mwciE}$ are:
-f string: File containing the MCI or MWCI specification
-c: Compute solutions
-a: Analyze or Assess (if reference contacts are known) solutions
–alpha float: Alpha value in [0,1] for the MWCI criterion (NB: default is alpha=1, with focus on the number of edges; alpha=0 yields a focus on the weights of the selected edges)
-u: Ignore the weights (solve MCI). Note that this is also achieved in using alpha=1



To solve MCI and analyze the resulting solutions, one runs:

> sbl-mwci.py -f demos/data/yeast-exosome-without-Csl4-mwci-spec.txt -c -a 

The specification of the yeast exosome, with selected labeled contacts, is accessible from the specification file in Table table-mwci-without-ref-input-files .

File Name

Description

MWCI Specification File

Does not contain the reference contacts

Input files for the first run described in section Input: Specifications and File Types .

If in addition one wishes to compare the solutions returned against a set of reference contacts, in order to compute signed contact scores instead of scores, the specification file must contains these reference contacts – see Table table-mwci-with-ref-input-files :

> sbl-mwci.py -f demos/data/yeast-exosome-without-Csl4-mwci-spec-with-ref.txt -c -a

Note that computation of solutions were already done, so that they are skipped by automatically loading them from an output serialized archive (json file format).

File Name

Description

MWCI Specification File

Contains the reference contacts

Input files for the second run described in section Input: Specifications and File Types .

Output: Specifications and File Types

Contacts and the Associated Statistics

For the first command line above, one gets files containing:

PreviewFile Name

Description

Solutions File (txt)

File summaring the solutions

Solutions File (json)

Serialized archive of solutions

Consensus Solutions File

File summaring the consensus solutions

Contact Scores Plot

Distribution of contacts' scores, see Fig. fig-contact-scores-yeast-exosome-without-Csl4

Consensus Contact Scores Plot Distribution of consensus contacts' scores, see Fig. fig-contact-scores-yeast-exosome-without-Csl4
Output files for the first run described in section Input: Specifications and File Types.

Contacts and Their Assessment

With respect to the previous call, the following new files are generated:

PreviewFile Name Description
Contacts Assessments

Assessment of contacts (sensitivity, specificity and coverage score)

Consensus Contacts Assessments

Assessment of consensus contacts (sensitivity, specificity and coverage score)

Contact Signed Scores Plot

Distribution of contacts' signed scores, see Fig. fig-signed-contact-scores-yeast-exosome-without-Csl4

Consensus Contact Signed Scores Plot Distribution of consensus contacts' signed scores, see Fig. fig-signed-contact-scores-yeast-exosome-without-Csl4
Output files for the second run described in section Input: Specifications and File Types.

Contact scores for Yeast Exosome (w/o Csl4)
A. Contact scores in the set of solutions B. Scores for consensus contacts

Signed scores of contacts for Yeast Exosome (without Csl4) with reference set of contacts
A. Signed contact scores in the set of solutions B. Signed scores for consensus contacts

Solving MWCI Problems: example of the Yeast Exosome

If weights are available, tuning $\alpha$ in Eq. eq-criterion allows one to strike a balance between the number of edges used versus the sum of weights of these edges.


Pre-requisites

Assume that each candidate edge $e$ has a confidence $w(e)\in [0,1]$, encoding its likelihood. To balance the relative importance of edge counts and edge weights, we use a real number $\alpha \in [0,1]$. Using this number, the contribution of edge $e$ in a solution becomes (see details in section Algorithms and Methods):

$ \text{Contribution of edge } e: \frac{\alpha+1}{2} - (1-\alpha) w(e) $

Thus, in solving an MWCI problem, we wish to find solutions minimizing the sum of such weights for all edges of the solution. All edges from the pool may carry a weight. Depending on which weights are specified, we proceed as follows:

  1. Edges with a weight specified: that weight is used

  2. Edge with no weight specification: a default value of 1/2 is used.

The following remarks are in order regarding labels versus weights:

  • To forbid an edge, one must use a label. As can be seen from Eq. eq-contrib-edge, this differs from using an arbitrarily small weights $w(e)$. From the same equation, it should also be noticed that setting $\alpha=0$ shifts the focus on weights–as opposed to contact counts.
  • To disfavor an edge, one uses a small weight. Note however, that using tiny weights for all edges will result in solution using these edges — even though they are considered as unlikely.
  • To favor an edge, one uses a large weight, the maximum being one. However, such an edge may not belong to any solution–which typically occurs if alternative edges are also favored.

Input: Specifications and File Types

Below, we show how to process the yeast exosome, with weights placed on selected edges – see Table table-mwci-weighted-input-files .

> sbl-mwci.py -f demos/data/yeast-exosome-without-Csl4-mwci-spec-with-ref-with-weights.txt -a -c --alpha 0.25

File Name

Description

MWCI Specification File

With weights on selected contacts and reference contacts

Input files for the first run described in section Input: Specifications and File Types .

Output: Specifications and File Types

The output files of the run described in section Input: Specifications and File Types are listed in Table table-mwci-weighted-with-ref-output-files .

PreviewFile Name

Description

Solutions File (txt)

File summaring the solutions

Solutions File (json)

Serialized archive of solutions

Consensus Solutions File

File summaring the consensus solutions

Contacts Assessments

Assessment of contacts (sensitivity, specificity and coverage score)

Consensus Contacts Assessments

Assessment of consensus contacts (sensitivity, specificity and coverage score)

Contact Scores Plot

Distribution of signed contacts' scores

Consensus Contact Scores Plot Distribution of signed consensus contacts' scores
Output files for the run described in section Input: Specifications and File Types.

Solving MWCI in the Weighted Case with the Bootstrap Method

Pre-requisites

The set of consensus contacts computed on solving the MWCI is a backbone of the protein assembly. It has ben observed [2] that the consensus contacts are small in number but are characterized by high specificity which is of great interest from strucutral point of view. We therefore seek to extend the set of initial set of consensus contacts using a bootstrapping procedure which iteratively forbids one consensus contact, so a to trigger a rewiring of the solutions:

  • Solve MWCI with a given value of $\alpha$ to get initial set of consensus contacts.
  • Barring the dimers in the consensus contacts, each consensus contact is precluded (or tagged as "Forbidden") one at a time and MWCI is solved for the same list of oligomers.
  • The consensus contacts associated with the consensus solutions of this new MWCI problem are computed and added to the initial set of contacts.
  • After precluding all consensus contacts from the initial set of consensus contacts and computing the union of consensus contacts at each stage, the final set of consensus contacts is reported.
  • Note that only initial consensus contacts are precluded in the bootstrapping procedure.

Input: Specifications and File Types

The command line to activate the bootstrapping procedure includes "-b" option as shown below:

> sbl-mwci.py -f demos/data/yeast-exosome-without-Csl4-mwci-spec-with-ref-with-weights.txt -s demos/Connectivity_inference/results/res-yeast-exosome-without-Csl4-mwci-spec-with-ref-with-weights-alpha-0.25.json -a -b --alpha 0.25
More than one consensus contact can be precluded at a time. This might increase in sensitivity but with a risk of lowering of specificity.


Note that by providing the json file containing the solutions, solutions are not recomputed, so that the focus is on the analysis of solutions and on the bootstrap procedure.

File Name

Description

MWCI Specification File

With weights on selected contacts and reference contacts

Solutions File (json)

Computed in the example of section Output: Specifications and File Types

Input files for the run described in section Input: Specifications and File Types .

Output: Specifications and File Types

There are three additional output file for this run, listed in Table table-mwci-bootstrap-output-files .

PreviewFile Name

Description

Set of Contacts from Bootstrap Set of contacts obtained from the bootstrap procedure
Bootstrap Assessment File File summaring the bootstrap statistics
Bootstrap Contacts Distribution File Histogram of distribution of contacts from the bootstrap procedure
Additional output files for the run described in section Input: Specifications and File Types .

Algorithms and Methods

Connectivity Inference: Weighted and Unweighted Cases

As shown in [2] , the weighted and unweighted cases can be handled at once by solving an optimization problem involving the following criterion $C$, with $\alpha$ a real number in $[0,1]$ and $w(e)$ the weight of edge $e$:

$ C = \alpha \sum_{e\in S} 1 + (1-\alpha) \sum_{e\in S} (1/2-w(e)) $

which rewrites as

$ C = \sum_{e\in S} \big( \frac{\alpha+1}{2} - (1-\alpha) w(e) \bigr) $

Enumerating the solutions of the optimization problem with criterion $C$, yields the consensus solutions of Def. def-scores-solutions.

Note that the previous equation reads as follows:

  • The first term weights the number of edges, i.e. the connectivity,
  • The second term weights the confidence on edges, edges with a probability higher than 0.5 being favored,
  • Adjusting the level of $\alpha$ strikes a balance between both terms.

Algorithms

To enumerate the solutions of the previous combinatorial problem, a Mixed Integer Linear Programming (MILP) exact formulation is resorted to. See [1] and [2] for further details on these algorithms.

The reader is also referred to [2] for a detailed presentation of the bootstrap method.

External Dependencies

This code is to be used with the SAGE open source mathematics software system (version 5.10 or higher). See also the book [164].

Solver

There are several options for the solver used:

  • GLPK : The default solver used, which is especially convenient since it is found in the SAGE distribution.
  • IBM Cplex : The CPLEX solver developed by IBM, free for academia. Possibly the fastest solution, yet, the installation requires some work. Installation details for SAGE can be found here.

Create symbolic link from appropriate directories in SAGE installed directory to CPLEX files

from ${SAGE_ROOT}/local/lib/ to the file "libcplex.a" in CPLEX lib directory,

> ln -s /path/to/lib/libcplex.a .

from ${SAGE_ROOT}/local/include/ to the file "cplex.h" in CPLEX include directory, type:

> ln -s /path/to/include/cplex.h .

from ${SAGE_ROOT}/local/include/ to the file "cpxconst.h" in CPLEX include directory (if it exists), type:

ln -s /path/to/include/cpxconst.h .