Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.

Authors: F. Cazals and T. Dreyfus and D. Mazauric
Optimal transportation problems have a long standing history in mathematics and computer science, originating with the works of Monge on earth moving ( << la théorie des déblais et des remblais >> ) [118] . Such problems were later rephrased in terms of Riemannian geometry and measure theory [148] , one key concept being the distance between two distributions, namely the minimal amount of work that must be performed to transform one distribution into the other by moving distribution mass around.
Optimal transportation based metrics are now widely used, with applications in image processing [135] , statistics [104] , geometry processing [141] , biophysics [36] ...
In this context, this package provides two algorithms:
The classical i.e. linear program based Earth Mover Distance (EMD),
We consider two sets of nodes, namely the source nodes , and the demand nodes respectively. We assume that these nodes are embedded in some metric space , and denote a metric in this space. We assume that each source node has a mass of supply (aka weights) denoted , and equivalently, that each demand node has a mass of demand . The sums of masses for all source and demand nodes are denoted and , that is and . We also assume that transporting a unit mass of supply from to demand node has unit cost .
A transport plan is a mathematical program aiming at satisfying the demand of the demand nodes given the supply of the supply nodes, while minimizing the overall transportation cost. More formally, a transport plan from to is defined by triples , with the socalled flow quantities. Note that there are at most such triples.
Finding the optimal i.e. least cost transport plan amounts to solving the following linear program (LP):
The first equation is the linear functional to be minimized. Constraints (1,2) respectively state that a demand node does not receive more than its demand , and that a source node does not export more than it supply. Constraint (4) states that the total flow circulating is the minimum mass of supply or demand.
Based on this linear program, we introduce the total number of edges, the total flow, the total cost, and their ratio, known as the earth mover distance [135] :
The basic problem.
Consider now the case where the source and demand nodes are the vertices of two graphs and . The solution provided by the EMD is oblivious to the connectivity of these two graphs. In selected applications, one may wish to preserve connectivity based properties. Typically, one would like a vertex or an edge of to export flow to a connected component of . In the sequel, we specify a new problem such that such constraints are respected–which requires using more stringent transport plans.
More formally, to see whether a transport plan is valid, pick any connected subgraph from . Let be the subgraph of induced by the vertices from such that for each such vertex , there is at least one edge emanating from a vertex of the subgraph with . The transport plan is called valid iff the subgraph is also connected.
We summarize this discussion with the following definition:
The following remarks are in order:
A transport plan respecting connectivity constraints may not fully satisfy the demand.
The problem solved: .
An alternative to the illposed problem of minimizing the transport cost while conserving connectivity constraints is the socalled maximumflow under cost and connectivity constraints problem ( ). In this problem one aims at computing the largest amount of flow that can be supported respecting the connectivity constraints and such that the total cost is less than a given bound . We define the following upper bound for the maximum total cost of any admissible flow as
Consider an upper bound , and an upper bound on the number of edges which carry non null flow. Denote and the set of all connected subgraphs of and , respectively. Problem is defined as follows:
Constraints (1,2,3) are the usual constraints found in EMD. Constraint (4) takes into account the upper bound on the transport cost, while constraint (5) bounds the number of edges used. Finally, constraint (6) encodes the connectivity constraints.
Algorithms.
Once the weights of all vertices, solving the linear program of Eq. (emdeqemdlp) has polynomial complexity [92] . Practically, various solvers can be used, e.g. the one from the Computational Geometry Algorithms Library [40] , lp_solve, the CPLEX solver from IBM, etc. The algorithm used in this package is lp_solve, which take as input the MPS file format.
In the following, the algorithm solving the linear program of Eq. (emdeqemdlp) is called .
Algorithms.
Finding transport plans respecting connectivity constraints are hard combinatorial problems. The following results are proved in [30] :
the decision version of is NPcomplete,
Practically, we solve with two algorithms denoted and . These algorithms are implemented within the class SBL::CADS::T_Earth_mover_distance_with_connectivity_constraints , and can be called by specifying a recursion mode:
Recursion mode norec: algorithm . A greedy strategy returning an admissible solution, given a maximum cost provided by the user.
Recursion mode refined: algorithm . Given a range of costs, this interval is iteratively halved, and the upper bound of each interval is used to compute a solution with this upper bound as constraint.
For a given solution, in a manner analogous to Eq. (emdeqemdlpdist), we define the total number of edges, the total flow, the total cost, and their ratio:
Vertices information.
The generic paradigm followed in the SBL implies that any kind of data structure could be used as input of the EMD algorithms. In particular, it could be two ensembles of geometric points, or two graphs where vertices are embedded in a metric space. For this reason, the algorithms are templated by a concept VerticesAccessor defining how to access the data used in the algorithms. (NB: an accessor allows iterating over the vertices and their neighbors, if any, and also allows access to the mass of vertices.) Two models of VerticesAccessor are predefined:
EMD algorithms.
Each algorithm is implemented in a separate class as a functor taking as argument the source and the demand, and returning the output transportation plan:
Connectivity checker.
In order to compare both algorithms described in sections Earth Mover Distance and Earth Mover Distance with connectivity constraints , the class SBL::CADS::T_Earth_mover_distance_connectivity_constraints_checker< EarthMoverDistance > computes the number of edges in a transportation plan that do not respect the connectivity of the demand. The template parameter EarthMoverDistance is one of the EMD algorithms class. Note that for a transportation plan computed from the class SBL::CADS::T_Earth_mover_distance_with_connectivity_constraints, the connectivity must be respected.
Module.
This package provides a module, see Modules, to be integrated into a workflow : SBL::CADS::T_Earth_mover_distance_module< ModuleTraits > . The template parameter ModuleTraits should define two types:
This module allows adding options to select the algorithm used, to run symmetric instances of the problems (source and demand are reversed), etc.
In addition to the log file containing statistics on the cost and the flow, two output files are reported :
The transportation plan, as a Graphviz file, suffixed by __transportation_plan.dot : this file can be processed by Graphviz to produce a visual representation of the transportation plan as a bipartite graph: the vertices shown are the source and demand nodes; for each edge, the flow and cost are represented.
A serialized XML archive suffixed by __emd_engine.xml : the archive contains in particular the serialization of the transportation plan, to be used with the package PALSE for further analysis. In particular, for each edge of the transportation plan, the flow and associated cost are provided.
The following example computes the transportation plan of two sets of three weighted 2D points with no connectivity, using the algorithm described in section Earth Mover Distance , and then dumps statistics about the transportation plan.
The following example computes the transportation plan of two sets of three weighted 2D points with no connectivity, using the algorithm described in section Earth Mover Distance with connectivity constraints , and then dumps statistics about the transportation plan.
The following example does the same calculations as in section Earth Mover Distance, but also checks that the connectivity constraints are respected. If not, the number of edges that violate the constraints is dumped.
The following example shows a simple instantiation of the EMD module within a workflow.
Executables in this package are named sblemdOBJCOST.exe, with
Objects. The rationale for handling these three types of objects is as follows:
Costs. A function assigning a positive value to two points/graph vertices. Recall that is this function is a metric (satisfies the triangle inequality), then EMD is also a metric.
Examples.
Example 1: Consider ddimensional points in the Euclidean space, for which one uses the Euclidean distance. We provide the three programs, detailed below:
Example 2: Assume now that the ddimensional points represent molecular conformations, for which one uses the least Root Mean Square Deviation as a distance. This package provides
Example 3: Assume now that the objects compared are sequences of nucleotides (nucleic acids) or aminoacids (polypeptide chains).
Main options – comparing weighted points.
points1: Text file listing the first Ddimensional points weights1: Text file listing the first weights
The main options are:
Main options – comparing graphs or transition graphs. The following options, each called twice i.e. once for each graph, are used to passe the input:
Main options for EMD with connectivity constraints. The main options are:
See the following jupyter notebook:
The main options of the compareSamples method in the next cell are:
import re #regular expressions
import sys #misc system
import os
import pdb
import shutil # python 3 only
def compareSamples(points1, weights1, points2, weights2):
odir = "tmpresultssamples"
if os.path.exists(odir):
os.system("rm rf %s" % odir)
os.system( ("mkdir %s" % odir) )
# check executable exists and is visible
exe = shutil.which("sblemdwpeuclid.exe")
if exe:
print(("Using executable %s\n" % exe))
cmd = "sblemdwpeuclid.exe pointsfile %s pointsfile %s v l d %s" %\
(points1, points2, odir)
if weights1 != None and weights2 != None:
cmd += " weightsfile %s weightsfile %s" % (weights1, weights2)
os.system(cmd)
cmd = "ls %s" % odir
ofnames = os.popen(cmd).readlines()
print("All output files:",ofnames)
#find the log file and display log file
cmd = "find %s name *log.txt" % odir
lines = os.popen(cmd).readlines()
if len(lines) > 0:
lfname = lines[0].rstrip()
print("Log file is:", lfname)
log = open(lfname).readlines()
for line in log: print(line.rstrip())
else:
print("Executable not found")
print("Marker : Calculation Started")
compareSamples("data/bln69_minima_1.txt","data/bln69_energies_1.txt",\
"data/bln69_minima_2.txt","data/bln69_energies_2.txt")
print("Marker : Calculation Ended")
The main options of the compareGraphs method in the next cell are:
import re #regular expressions
import sys #misc system
import os
import pdb
import shutil # python 3 only
def compareGraphs(points1, weights1, edges1, points2, weights2, edges2,\
constraints = False, algorithm = "reg", edgeSelection = "mincost",\
recursion = "refined"):
odir = "tmpresultsgraph"
if os.path.exists(odir):
os.system("rm rf %s" % odir)
os.system( ("mkdir %s" % odir) )
# check executable exists and is visible
exe = shutil.which("sblemdgrapheuclid.exe")
if exe:
print(("Using executable %s\n" % exe))
cmd = "sblemdgrapheuclid.exe vertexpointsfile %s vertexpointsfile %s v l d %s" %\
(points1, points2, odir)
if weights1 != None and weights2 != None:
cmd += " vertexweightsfile %s vertexweightsfile %s" % (weights1, weights2)
if edges1 != None and edges2 != None:
cmd += " edgesfile %s edgesfile %s" % (edges1, edges2)
if constraints:
cmd += " algorithm \"%s\" edgeselection \"%s\" recursionmode \"%s\"" %\
(algorithm, edgeSelection, recursion)
os.system(cmd)
cmd = "ls %s" % odir
ofnames = os.popen(cmd).readlines()
print("All output files:",ofnames)
#find the log file and display log file
cmd = "find %s name *log.txt" % odir
lines = os.popen(cmd).readlines()
if len(lines) > 0:
lfname = lines[0].rstrip()
print("Log file is:", lfname)
log = open(lfname).readlines()
for line in log: print(line.rstrip())
else:
print("Executable not found")
print("Marker : Calculation Started")
compareGraphs("data/bln69_minima_1.txt","data/bln69_energies_1.txt", "data/bln69_edges_1.txt",\
"data/bln69_minima_2.txt","data/bln69_energies_2.txt", "data/bln69_edges_2.txt", True)
print("Marker : Calculation Ended")