Structural Bioinformatics Library Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Authors: F. Cazals and T. Dreyfus

# Goals: Managing Computer Experiments

## Overall Goals

A typical computer experiment consists of rounds, including:

• Data collection. This step consists of collecting and organizing existing data, or of generating data.
• Program execution. This step consists of running a computer program, in general using various parameter sets.

The current package provides a solution to handle these two steps, using the notion of batch:

A batch specifies a coherent set of executions of a program, as specified in a so-called run specification file. These executions may or may not require input data files. In the former case, the batch stems from the association between a dataset and the run specification file.

These notions are defined below.

Statistical analysis on data generated is covered in the PALSE package.

# Using the Batch Manager

The whole package is written using Python.

Section Pre-requisites discusses the different objects manipulated, and in particular so-called batches. Section Input: Dataset and Run Specifications specifies the way batches are created. Finally, section Output: Runs of Batches explain how to run batches.

## Pre-requisites

We define sequentially the notions required to use the batch manager:

• The run specification: specifies a particular run via so-called input file options (IFO) and non input file options (NFO)
• The run specification files: the file defining IFO and NFO, and from which the individual runs are defined.
• The batches: groups of individual runs.

### Dataset

Dataset. A dataset is a coherent collection of files, meant for processing by a given program, say P.exe in the sequel.

A dataset is identified by the directory containing it.

Database. A database is a list of datasets. There are two ways to list the entries contained within a database:

• Enumeration: one lists the individual datasets in the database.
• Regular Expression: one defines a regular expression identifying all the sub-directories of a given directory, to be included in the database.

### Run Specification

We wish to run the program P.exe on a dataset. Also assume that P.exe has a number of command-line options, complying with the following usual conventions:

• the option is composed of one dash followed by one letter, or two dashes and more than one letter,
• the argument of the option follows directly the option separated only by space characters.

Run specification tuples. A run specification tuple also called tuple for short is a set of options fully specifying one execution of P.exe. The tuple is represented as a list of name-value-pair: the name stands for the option, and the value for the argument. When the option does not require any argument, its value is the empty string.

For example: (f, 1vfb.pdb) stands for -f 1vfb.pdb, (number, 15) stands for –number 15, (verbose, ""), stands for –verbose , etc.

A run specification ensemble is a list of run specification tuples. The package provides two ways to generate such ensembles:

• Enumeration: one lists the individual run specification tuples.

### Input File Options

When the argument of an option is an input file, the option is termed an Input File Option (IFO). The number of IFO of a program determines its arity .

Specifying an IFO requires three ingredients: (i) the name of the option, (ii) a dataset containing the possible input files of the option, and (iii) a rule specifying how to generate the name-value-pairs of IFO within a run.

For the sequel, we wish to stress the following:

An IFO declaration specifies a single file (case simple IFO), or a tuple of file (declaration tuple IFO).

Simple IFO declaration. The IFO are specified in a run specification file formally defined below, with the keyword IFO followed by the option name and possibly a regular expression characterizing the input file names for this option. In the following example, one declares two input files with the option names file-1 and file-2 , the input files being any text file for the former and any PDB file for the latter:

IFO file-1 ".*.txt"
IFO file-2 ".*.pdb"

In the specification of files for IFO, as done above, and also while callling the method SBL::Batch_manager::BM_Batch::load_dataset (see below), one uses python regular expressions, not file specifications with wildcards as done in a shell. For example, for PDB files with extensions .pdb, one writes .*.pdb and not *.pdb.

Tuple IFO declaration. In using say two simple IFO declaration, both input files are totally independent, that is, there is an association rule between them. Sometimes, one wishes to group input files, for example by pairs of files with a common prefix. This is possible by specifying tuples of IFO :

IFO (file-1, file-2) "common regex" ("regex_1", "regex_2")


In this case, pairs of input files are filtered such that both files match the common regular expression exactly in the same way, and each file is further specified by its particular regular expression. Note that the regular expressions have to be enough specific : if there is an ambiguity on how several files should be associated within a tuple, an error is returned.

Advanced The tuples are constructed with the following strategy : for each regular expression in parenthesis, the list of files having a sub-string matching the corresponding regular expression is filtered. Then, for each element, the sub-list of filtered files having a sub-string matching the common regular expression is computed. At this step, the matching sub-string for each file in the sub-list is stored. The tuples of files are then constructed by associating files having the exact same matching sub-string for each element of the tuple. Note that if there are multiple files with the same matching sub-string for the same element, tuples cannot be constructed in a unique way, and an error is returned. This can be avoided by setting more specific regular expressions

Specification: Dataset. Specified by the name of the directory defining the dataset.

Specification: Name-value-pairs. There are five types of rules to specify the name-value-pairs. We illustrate each of them with an example;

IFO-ASSOC-RULE none

• none rule : when a program has no IFO, there is no association rule.
IFO-ASSOC-RULE unary

• unary rule : this case covers two sub-cases:

Case 1: for a simple IFO declaration: the unary rule states that a single file is used as input.

Case 2: for a tuple IFO declaration: the unary rule states that a unique/single tuple of input files is used as input.

IFO-ASSOC-RULE cartesian-product <arity>

• cartesian-product rule : in this case, k>1 IFO specifications (simple or tuple) are provided. The cartesian product between the name-value pairs for these k declarations is taken.
IFO-ASSOC-RULE combinatorial <arity>

• combinatorial rule : This case is identical to the cartesian product, yet, the order of name-value pairs does not matters. Therefore the cases are generated. Note that since the order does not matter, the k IFO declarations must be consistent. To enforce this, the user is forced to use a single IFO declaration (simple or tuple).

IFO restrictions. When using the cartesian-product rule, one would like to form only a subset of tuples of input files. For example, only files starting with the same keyword should form a tuple of input files for a given run. To do so, one can use the IFO-REGEX specification:

IFO-REGEX "regex"


This additional specification will constrain all input files of a given run to match identically the input regular expression.

### Non-input File Options

When the argument of an option is not an input file, the option is termed a Non-input File Option (NFO). The following examples illustrate the three ways to declare a NFO:

• by declaring only a name:
NFO verbose

• by declaring a name followed by one or more values seperated by spaces:
NFO prefix alice bob

• by declaring one tuple of names followed by tuples of values with the same number of elements, so as to instantiate the options of the first tuples with the values of the remaining ones (the assignment respects the linear ordering):
NFO (numvertices, numedges) (10, 9)  (100, 99)  (1000, 999)


Note that the last two options yield a list of values. If several NFO options are declared, the cartesian product of these lists is taken, therefore defining the list of possible combinations of NFO.

### Run Specification File

A run specification file is a file specifying the IFO and NFO options just discussed. Each command line of the specification file starts with a keyword, and there are four such keywords:

• PROGRAM : specifies the path to the program to run.
• IFO-ASSOC-RULE : specifies the rule for associating the IFO (see section Input File Options).

Example run specification files are given in the section Examples

### Batches

As defined in Introduction, a batch is a set of coherent executions, as specified in the run specification file.

If these executions involve a dataset, the files in the dataset are used to fill input file options values using the association rules specified in the run specification file.

More precisely, given a dataset and a run specification file, a run specification ensemble is constructed as follows:

• First, the association rule is used on the dataset to create the IFO part of all run specifications.
• Second, the values provided for the NFO are used to create the NFO part of all run specifications.
• Finally, the set of all run specification tuples is formed by taking the cartesian product of the previous two ensembles.

A batch yields an output directory containing the output file, the directory name being automatically inferred. A default output directory name is given as the name of the program without extension followed by the name of the directory of the dataset. For example, if on wants to run the program P.exe with the dataset data , the name of the output directory will be P-data .

In fact, when running a batch, the output directory is created, and the program is run from this output directory.

### Splitting Batches

A batch can be split into smaller batches, in four ways:

• split per NFO: each resulting is invariant w.r.t all the NFO values.
• split per IFO: each resulting batch is invariant w.r.t all the input files.
• split per option: each resulting batch is invariant w.r.t the value associated to an input option name.
• split per list of options: each resulting batch is invariant w.r.t the combination of values associated to the option names present in the list passed. (That is, as opposed to the previous case, the batch fixes the values for several options.)

When splitting a batch, each resulting batch has its own output directory: the name of the output directory of a split batch is the one of the parent batch, followed by the name-value-pairs of the invariant options. See section Coupling Data Generation and Processing with two Specification Files for an example.

## Input: Dataset and Run Specifications

Creating a batch requires two inputs:

• the input dataset: it is loaded using the method SBL::Batch_manager::BM_Batch::load_dataset. This method takes three arguments: (i) the directory of the dataset,(ii) a regular expression matching all the files to be included in the dataset, and (iii) a tag checking whether sub-directories should also be included in the dataset to recursively collect the files. The following command loads a dataset from the directory input-data with all the PDB files it contains (extension .pdb), but without searching in the sub-directories.
batch.load_dataset("input-data", ".*.pdb", False)

• the run specification file: it specifies the path to the program, the list of IFO with an association rule, and the list of NFO and their possible values. A run specification file is loaded with the method SBL::Batch_manager::BM_Batch::load_run_specification. This method takes as unique argument the path to the run specification file. The following command loads a run specification file of name run-specification.txt in the current directory.
batch.load_run_specification("run-specification.txt")


## Output: Runs of Batches

Once a batch is created and associated to a dataset and a run specification ensemble, one can run the batch, i.e running all the executions listed in the run specification ensemble. This is done using the method SBL::Batch_manager::BM_Batch::run :

batch.run()


If it does not exist, the output directory is created and will be filled with the output of the different execution. The method SBL::Batch_manager::BM_Batch::print_batch dumps into the console the list of command line executions, without running them:

batch.print_batch()


The method SBL::Batch_manager::BM_Batch::get_lists_of_run_options returns the lists of options for each run in the batch as name-value pairs when a value is specified, and as a simple name when there is no value associated to the option:

nvps = batch.get_lists_of_run_options()


It is also possible to run the batch such that each execution is repeated a number of time. This feature is of special interest to generate several instances of random data, to be used to probe an algorithm. The corresponding method is SBL::Batch_manager::BM_Batch::repeat:

batch.repeat(10)


In this case, each output data directly created in the output directory of the batch is suffixed with _instance_n , where n is replaced by the number of the instance of the execution that generated this data.

If one wants to create scripts for running each execution separately (e.g when using other systems for running the batch, as QSUB ), the method make_scripts create one program python script for each execution:

batch.make_scripts()


This will print in the output directory of the batch one file per execution with the name run_<output_directory_name>_<index> , where the index identifies the execution among all the executions of the batch.

Finally, it is possible to split a batch as mentioned in section Batches, and run them separately:

for b in batch.split_per_NFO():
b.run()

for b in batch.split_per_IFO():
b.run()

for b in batch.split_per_selected_option("option-name"):
b.run()

for b in batch.split_per_selected_options(["option-name-1", "option-name-2"]):
b.run()


In all cases, one output directory is created per batch with the convention of the section Batches

Consider the case where the specification file contains NFO, e.g. NFO a a-one a-two, where the function split_per_NFO() is called, and where the program called does not incorporate the values of the NFO options in the output filenames. In that case, the output directory does not contains the values of NFO, and the files found in that directory do not contain them either. In other words, the files produces are lost. To fix this issue, two options: either use split_per_NFO(), or make sure that the option values appear in the output filenames.

## Examples

In the following, we present three examples using Batch_manager : the first one manually specifies a batch; the second one uses a run specification file to specify calculations; the third one uses two specification files to couple a step of data creation and a step of data processing.

### Specifying the Runs of a Batch

This example shows how to specify one run of an application computing the volume of a molecular structure, and run the batch.

• batch-simple-example.py : python script using the Batch_manager for computing the volume of one given molecular structure.

### Specifying a Batch over a Dataset using one Specification File

This example shows how to specify a batch using a run specification file and a dataset, then how to split the batch, and then run the splited batches. The first file corresponds to the run specification file of the application computing the volume of a molecular structure. The second file is the python script running the batches.

• batch-vorlume-pdb.spec : specification of the computation of the volume of a molecular structure.
• batch-dataset-example.py : python script using the Batch_manager for computing the volume of molecular structures in a given dataset.

### Coupling Data Generation and Processing with two Specification Files

In this example, Batch_manager is first used to generate data using one program (dumping a collection of balls into a file), and then to process these data using a second program (computing the volume of the union of balls generated).

This pipeline requires two specification files, namely one to generate the input balls, and one to compute the volumes.

The four files below are as follows: first, the python script generating the balls; second and third, the two specification files, and fourth, the main python script.

• generate-random-balls-3.py : python script for generating random 3D balls with centers located in a bounding box.
• batch-generate-balls-3.spec : specification of the execution of the random 3D balls generator.
• batch-vorlume-txt.spec : specification of the execution of the volume computation of an union of 3D balls.
• batch-create-and-run-example.py : python script using the Batch_manager for generating random 3D balls and computing the volume of their union.

### Using more complex association rules

In this example, Batch_manager is first used to generate data using one program (dumping a collection of 2D points into a file), and then to process these data using a second program (clustering those points following their proximity in space). A Python script is then called for visualizing the clusters. Finally, the clusters from the different data, are compared two by two.

There are two tricks in this pipeline.

The visualisation script requires three input files :

• the original points file
• the clusters file, each cluster being represented by an index associated to each point in the original points file.
• the centers file, listing points representing each cluster

If one wants to run several times the clustering with different parameter values, then we obtain several clusters and centers files : we do need to specify how these fiels are associated. Moreover, if several original points are generated, we need to associate each origin points files to a set of tuples of clusters and centers.

The comparison program requires two input graphs as input, each graph being represented by three input files :

• the points file specifying the points to be compared–these are vertices of a graph.
• the weights file which assigns a weight to each point (the size of the corresponding cluster).
• the edges file which links points (i.e. clusters). To compare two clustering, we therefore need to compare all possible combinations of those tuples.

This pipeline requires four specification files, namely one to generate the input points, one to compute the clusters, one to visualize them, and one to compare the clusters.

The five files below are as follows: the four specification files and the main python script;

• batch-generate-points-2.spec : specification of the execution of the random 2D points generator.
• batch-cluster.spec : specification of the execution of the cluster machine.
• batch-cluster-visu.spec : specification of the execution of the visualization script.
• batch-compare-clusters.spec : specification of the execution of the comparison of the clusters.
• batch-create-run-compare-example.py : python script using the Batch_manager for generating random 2D points, computing clusters and comparing those clusters.

# Programmer's Workflow

In addition to SBL::Batch_manager::BM_Batch, the Batch_manager package involves four main classes:

# FAQ

## My output files are created in the dataset directory

As mentioned in the section Programmer's Workflow, when loading a dataset, the absolute path to each input file is stored. When setting the name of an output file from the name of an input file, make sure to remove the directory prefix, so as to keep the filename only.

# Jupyter demo

See the following jupyter demos:

• Jupyter notebook file
• Batch_manager

# Batch_manager: Managing multiple runs¶

## Specifying the Runs of a Batch¶

This first example shows how to specify one run of an application computing the volume of a molecular structure, and run the batch. In this example, all options are passed directly to the batch manager.

In [1]:
from SBL import Batch_manager
from Batch_manager import *
print("Marker : Calculation Started")
batch = BM_Batch()
batch.run_specifications.run("results", 1)
print("Marker : Calculation Ended")

Marker : Calculation Started
Running : /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-vorlume-pdb.exe --output-prefix --log -f/home/fcazals/projects/proj-soft/sbl/Applications/Batch_manager/demos/data/1vfb.pdb
Marker : Calculation Ended


## Specifying a Batch over a Dataset using one Specification File¶

In general, it is more convenient to store all options in a specification file. This example shows how to specify a batch using a run specification file and a dataset, then how to split the batch, and then run the splited batches. The specification file corresponds to the computation of the volume of a molecular structure. The content of the specification file batch-vorlume-pdb.spec is shown below :

#The executable.
EXECUTABLE sbl-vorlume-pdb.exe
#Any .pdb file from the dataset is elibigle for the option -f, as specified by the python regex.
#NB: the dataset name is specified when constructing the batch in the python script.
IFO f "\.pdb$" #There is a unique IFO, with one execution per value i.e. file. IFO-ASSOC-RULE unary #These are the Non File Options for the executable. NFO log NFO output-prefix NFO verbose NFO no-water NFO radius 0 1.4 NFO p 4 ## This specification file is used as follows from python¶ In [ ]: from SBL import Batch_manager from Batch_manager import * print("Marker : Calculation Started") batch = BM_Batch() batch.load_dataset("data", ".*", True) if batch.load_run_specification("batch-vorlume-pdb.spec"): batches = batch.split_per_NFO() for b in batches: b.run() print("Marker : Calculation Ended")  Marker : Calculation Started Loading the run specification from batch-vorlume-pdb.spec Building one batch per NFO tuple... Building the batch for sbl-vorlume-pdb.exe... Building the batch for sbl-vorlume-pdb.exe... Running : /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-vorlume-pdb.exe -f/home/fcazals/projects/proj-soft/sbl/Applications/Batch_manager/demos/data/1vfb.pdb --log --output-prefix --verbose --no-water --radius=0 -p4 Running : /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-vorlume-pdb.exe -f/home/fcazals/projects/proj-soft/sbl/Applications/Batch_manager/demos/data/1igt.pdb --log --output-prefix --verbose --no-water --radius=0 -p4 Running : /user/fcazals/home/projects/proj-soft/sbl-install/bin/sbl-vorlume-pdb.exe -f/home/fcazals/projects/proj-soft/sbl/Applications/Batch_manager/demos/data/1urz.pdb --log --output-prefix --verbose --no-water --radius=0 -p4  In [ ]:   ## Coupling Data Generation and Processing with two Specification Files¶ In this example, Batch_manager is first used to generate data using one program (dumping a collection of balls into a file), and then to process these data using a second program (computing the volume of the union of balls generated). This pipeline requires two specification files, namely one to generate the input balls, and one to compute the volumes. The specification file batch-generate-balls-3.spec to generate balls goes as follows: #The executable. EXECUTABLE generate-random-balls-3.py #There is no Input File Option. IFO-ASSOC-RULE none #The S and s parameters are specified by pairs of values: five pairs here. NFO (number, max-radius) (100, 1) (200, 2) (500, 3) (1000, 4) (10000, 5) #The centers of the 3D balls are generated in [0,10] x [0,10] x [0,10]. NFO box-size 10 The computation of the volumes is specified as follows in batch-vorlume-txt.spec: #The executable. EXECUTABLE sbl-vorlume-txt.exe #Any .txt file from the dataset is elibigle for the option -f, as specified by the python regex. #NB: the dataset name is specified when constructing the batch in the python script. IFO f "\.txt$"
#There is a unique IFO, with one execution per value i.e. file.
IFO-ASSOC-RULE unary
#These are the Non File Options for the executable.
NFO log
NFO output-prefix
NFO verbose

## As before, these two spec files are used by the batch manager:¶

In [ ]:
from SBL import Batch_manager
from Batch_manager import *
print("Marker : Calculation Started")
batch_data = BM_Batch()
output_dirs = []

for b in  batch_data.split_per_NFO():
b.repeat(10)
output_dirs.append(b.get_output_directory())

batches = []
for directory in output_dirs:
batches.append(BM_Batch())
batches[-1].run()
print("Marker : Calculation Ended")


## Using more complex association rules¶

In this example, Batch_manager is first used to generate data using one program (dumping a collection of 2D points into a file), and then to process these data using a second program (clustering those points following their proximity in space). Finally, a data comparison is performed (the clusters of points are compared two by two).

If one wants to run several times the clustering with different parameter values, then we obtain several clusters and centers files : we do need to specify how these files are associated. Moreover, if several original points are generated, we need to associate each origin points files to a set of tuples of clusters and centers.

The comparison program requires two input graphs as input, each graph being represented by three input files :

• the points file specifying the points to be compared–these are vertices of a graph.

• the weights file which assigns a weight to each point (the size of the corresponding cluster).

• the edges file which links points (i.e. clusters). To compare two clustering, we therefore need to compare all possible combinations of those tuples.

This pipeline requires three specification files:

• specification of the execution of the random 2D points generator batch-generate-points-2.spec

#The executable.
EXECUTABLE generate-random-points-2.py
#There is no Input File Option.
IFO-ASSOC-RULE none
#The S and s parameters are specified by pairs of values: five pairs here.
#NFO (N,d) (10,2) (10,3)
NFO N 1000
NFO d 2 4 6
• specification of the execution of the cluster machine batch-cluster.spec

#The executable.
EXECUTABLE sbl-cluster-MTB-euclid.exe
#There is a unique IFO, with one execution per value i.e. file.
IFO-ASSOC-RULE unary
IFO points-file "\.txt\$"
NFO num-neighbors 4
NFO persistence-threshold 0.1
# output files with proper prefix
NFO o
• specification of the execution of the comparison of the clusters batch-compare-clusters.spec

#The executable.
EXECUTABLE sbl-emd-graph-euclid.exe
#We which to make all combinations of pairs of the next IFO
IFO-ASSOC-RULE combinatorial 2
IFO (vertex-points-file, vertex-weights-file, edges-file) "N\d+-d\d+.*persistence_0dot\d+" ("points\.txt", "weights\.txt", "edges\.txt")
NFO v
NFO u
NFO with-connectivity-constraints 1 0
In [ ]:
from SBL.Batch_manager import *
print("Marker : Calculation Started")
batch = BM_Batch()
batches_d = batch.split_per_selected_option("N")
odirs_data = []
for b in batches_d:
b.run()
odirs_data.append(b.get_output_directory())
odirs_clust = []
for directory in odirs_data:
batch = BM_Batch()
batch.run()
odirs_clust.append(batch.get_output_directory())
for directory in odirs_clust:
batch = BM_Batch()