Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Module_base

Authors: F. Cazals and T. Dreyfus

Introduction

Rationale. Numerous applications resort to the same data processing. For example, computing the volume of a molecular models merely requires computing the volume of a union of balls, whatever the semantics (atoms, pseudo-atoms) of these balls. To make such routine operations available on the shelf, we use modules. Modules are small pieces of C++ code which can be chained within a workflow to define a whole application. In short, a module is characterized by three main features:

  • A module is a class implementing a transformation between inputs and outputs.
  • A module is a generic independent block of instructions, occurring in an application represented by a directed graph called the workflow. This graph is anchored at a specific vertex called the start vertex, and may contain cycles in case a sequence of operations is repeated. Within an application, the modules are recursively executed by visiting the vertices of the graph following a variant of a depth first algorithm started at the start vertex (see remark below).
  • The modules are connected in the workflow using directed edges, each edge representing a method passing the desired input data from the source module to the target module. These data are attributes of the modules, and can be objects (i.e. class instances) or pointers.
Topological sorting is classically used to execute the tasks of a workflow represented by a directed acyclic graph. The workflows from the SBL are slightly different as they may contain cycles. To overcome this difficulty, we rather use a depth first algorithm initiated at the start vertex and using properties of specific vertices of the workflow (conjunction and conditional vertices), see section Structure and modification of workflows .


Importance of modules and workflow. Connecting modules using a directed graph which defines a workflow yields immediate key functionalities:

  • Options management. Command-line options are grouped, with a distinction between those common to all applications (i.e available from the workflow itself), to those embedded in specific modules – see section End-user overview: generic options for applications. In particular, the latter options can be grouped by key steps of the application.

Overview. In the following, we first explain how to customize existing applications using modules and workflows in section Customizing Applications. We then explain how to create one's workflow in section Developing Workflows with Existing Modules, and finally how to create new modules in section Developing Modules.

Customizing Applications

Design overview

To go straight to the point, let us review the main steps involved in defining an application, based on modules and a workflow:

Step 1. For each class performing a significant step in the application, one defines the module associated to this class. One can consult SBL: existing modules

Step 2. Because our goal is to handle coherently cases where the objects manipulated are the same – yet with different semantics (see the example of balls in Introduction), we assemble within the workflow traits and the workflow classes the required types

  • Define template<class WorkflowTraitsTypes> T_Workflow_traits, or plainly Traits if there are not template parameters.
  • Define template<class WorkflowTraits> T_Workflow, or plainly Workflow if there are not template parameters.

Step 3. Assemble the application by successively

See the example below.

End-user overview: generic options for applications

Applications in the SBL intensively use modules to re-use core components involved in several applications, and to easily customize a given application to custom data models. Static versions of the executables are available at the SBL Applications page, or can be compiled directly from the library using CMake .

From the end-user standpoint, modules offer handy features, namely the possibility to :

  • display the workflow of the application, in order to see the main steps (i.e modules) of the algorithm used in the application,

  • display the help which groups the options by steps of the application.

In addition to specific options inherent to a given application, the following generic options—implemented in the workflow data structure—are offered for all applications:

  • –help : prints on the standard output all the command-line options of the application, with a brief description.

  • –config-file : reads the command-line options from a configuration file as specified in the Boost Program Options documentation.

  • –workflow : prints the workflow as a directed graph, namely .dot file to be visualized with Graphviz. The command for converting the .dot file onto a .pdf file is printed on the standard output.

  • –verbose : dumps high level statistics:

    • 0 : default option: no output gets printed on the console.

    • 1 (run level): the program prints intermediate statistics on the tasks performed.

    • 2 (final statistics): the program prints the final statistics.

    • 3 (all, implicit behavior if -v is used): run level + final statistics.

  • –log : Redirects the standard output into a log file (except for the help and the workflow). It sets the verbose level to final statistics, unless the verbose level is specified separately.

  • –colored-log colors the log (red for module and loader names, green for statistics, blue for report).

  • –directory-output : enforces the creation of all output files into the specified directory. NB: the directory must exist.

  • –output-prefix : if specified with an argument, uses the specified prefix for all output files; if no prefix is specified, generates an output prefix by concatenating the specified option names and the associated values.

  • –uid : adds to the end of the output prefix a unique identifier, corresponding to the local date and time, at the micro-second time scale.

  • –report-at-end : reports all the output files after the execution of all modules in the workflow, rather than after each module execution.

  • –report-options : reports a XML archive listing all the options used for the current calculation.
(Advanced.) Consider the case where a collection of items stored in an array / a vector is processed. While in computer science indices usually start at 0, they commonly start at 1 in biology. To accommodate both variants, selected applications of the SBL use the internal data structure SBL::IO::T_Index_wrapper for replacing the array indices used during calculations. Internally, indices always start at 0. But if the option –indices-from-1 is used, indices reported in output statistics and files start at 1.


The two options –output-prefix and –uid add a common prefix to all output files of a given execution, identifying the output of a process. The case study is as follows:

Without uid With uid
Without output prefix <application-name> <application-name>__<uid>
With implicit output prefix <application-name>__<input-option-values> <application-name>__<input-option-values>__<uid>
With explicit output prefix <explicit-prefix> <explicit-prefix>__<uid>
Often wise, running the same program with different arguments should results in filenames containing selected argument values to avoid overwriting. This is the raison d'etre of the output prefix. If the output prefix does not contain selected options, proceed as follows to add them
  • Run the program with –help to see which module declares the option of interest.
  • locate the get_output_prefix(void)const function, and enrich it.


A module developed for a particular application typically inherit from Module_base. Inheritance allows in particular to access the various generic options discussed above. Consider for example the verbose level. Once its value from Module_base has been retrieved, it can be passed to a specific algorithm so as to selectively dump the relevant information either into the std::cout stream, or in a file if verbose level 0 has been chosen.


Customizing applications: workflow's traits class

In C++, a traits class is a data structure defining types. In generic programming paradigm, traits are used to group the definitions of types used in template classes.

In our case, template classes are the modules of the SBL, and the traits classes used to instantiate them provide the required types.

All applications in the SBL were developed following the same guidelines :

  • Workflow class. The workflow base class SBL::Modules::T_Module_based_workflow specifies the common features to all workflows (building the workflow, executing the modules within the workflow, adding common command-line options to a program, etc.). This workflow template class is parameterized by a traits class, and the workflow class of any application inherits from it.

  • Traits class. A Workflow class uses a single traits class defining all the types used by the modules in the workflow. The type definitions are ascribed to two categories : the biophysical models and the other types; generally, all biophysical models are conceptualized, i.e they are the template parameters of the traits class–that is the traits class is itself a template class.

  • Application source file. There is one source (.cpp) file for each program of the application using different biophysical models; the source file instantiates the traits template class with particular models, and then instantiate the workflow template class with the defined traits class.

Thus, adapting an existing application to a given biophysical model merely requires modifying the biophysical model definition in the source. To ease this process, these biophysical models generally match biophysical concepts in Models, so that the corresponding concept has a full documentation (user manual and reference manual), and a list of predefined models.

As an example of workflow, consider the application Space_filling_model_interface_finder, which computes the binary interfaces between the chains or domains of a molecule, and outputs a graph representing those interfaces. The workflow for this application is displayed just below. The traits class of the workflow is T_Space_filling_model_interface_finder_traits and is templated by four concepts:

  • ParticleTraitsBase : class providing a base representation of a particle (atom or pseudo-atom), see ParticleTraits ;

  • PartnerLabelTraits : class providing a hierarchy of labels to assign the particles to partners. Interfaces are indeed sought between partners identified by these labels. See details in MolecularSystemLabelsTraits ;

  • ParticleAnnotator : class providing an annotation system attached to the particles, either for annotations directly used in the calculations (e.g the radii of particles), or for user defined annotations not used for calculations but delivered in the output files, see ParticleAnnotator ;

  • ParticlesBuilder : a functor (i.e a class providing the operator "()") building the particles from an input data structure to the data structure provided by ParticleTraitsBase , see ParticleTraits .

Each of these concepts have models provided by packages in Models, so that producing programs of the same application in a different context could be a matter of replacing one line in the corresponding source file.

Developing Workflows with Existing Modules

We now discuss how to create an application using modules, which encompasses 3 main steps:

  • registering modules: adding the module to a workflow. Ppractically, a module object is created and the corresponding node is added to the workflow (a graph).
  • connecting modules: adding edges to the previous graph, and defining how the information circulates along these edges.
  • initializing modules: setting the initial values of the attributes of each module, if necessary.

In the sequel, we provide a simple program using modules, and further discuss the three steps just described.

A simple example

This first example involves a toy program using a workflow with only one module.

The program just creates a grid 100x100 of 2D points, and store the points in a spatial search engine–to be used for example to report the nearest neighbors of a given point. It uses in particular the module from the package Spatial_search. This module is templated by a Traits class defining two types :

  • Points_container : the type representing the container of 2d points,

  • Distance , a functor (i.e a class defining the operator "()") computing the distance between two 2d points,

To embed a module into a workflow, we define a class inheriting from SBL::Modules::Module_based_workflow, which

The following example illustrates these steps on our simple application:


The constructor of the base workflow takes a string as argument representing the application name. This name is used to prefix output files, if any.


In the main() method, the possible command-line options are parsed with the method SBL::Modules::Module_based_workflow::parse_command_line, then the method SBL::Modules::Module_based_workflow::start effectively starts the workflow execution with its first module.


Executing this program without any option yields the help of the program. An effective execution requires at least one option to be specified, e.g the –verbose option (which is off by default).

Of particular interest is the –workflow option, which does not execute the workflow but instead creates a Graphviz file representing the workflow. Once processed by Graphviz, in image file of the workflow is obtained, allowing to an overview of the application at a glance.

In the sequel, we detail the aforementioned steps.

Defining the traits class

The workflow's traits class is described in section Customizing applications: workflow's traits class .

Registering the modules

Registering a module consists in creating the corresponding node in the workflow graph. This is done using the method SBL::Modules::Module_based_workflow::register_module .

Selecting starting modules

This step consists in specifying which module(s) are executed first. This is done using the method SBL::Modules::Module_based_workflow::set_start_module. Note that the workflow can start with multiple modules rather than one : in such a case, the execution order of these modules is arbitrary.

Initializing the input of a module

In the example of section A simple example , the attributes of the module have been initialized by directly setting them. It actually turns out that there are three ways to initialize the input of a module:

  • Initialization by setting attributes. One just calls a function modifying attribute(s) of the module.

  • Initialization via the connection of two modules. In this case, the output of a source module becomes the input of a target module.

  • Initialization from loaded data. In this case, the initialization involves data loaded from a file into the module, the required information (filename, possible options) coming from the command-line options.

We now detail the latter two options.

Initialization via the connection of two modules.

This is done using the method SBL::Modules::Module_based_workflow::make_module_flow . This method takes four arguments : the vertex representing the source module, the vertex representing the target module, a linker function allowing to initialize the target module from the source module, and optionally a name to display over the arc representing the connection in the workflow's graph display.

(Advanced.) In order to be as generic as possible, the linker method to be passed to SBL::Modules::Module_based_workflow::make_module_flow can be a pointer to a static method, or a functor (i.e a class defining the operator "()"). In this latter case, the functor has to be a template specialization of the class SBL::Modules::T_Join_functor . The functors are particularly useful when a linker method requires parameters external to the input and output modules.


Initialization from loaded data.

Data can be loaded from files using special classes called loaders. The framework of modules provides a number of loader classes loading data from a file into main memory. All loaders specialize the virtual class SBL::IO::Loader_base by possibly re-implementing the method SBL::IO::Loader_base::load . A loader can be independently used, or can be stored as an external property of the workflow graph. In doing so, command-line options related to the data to be loaded are added to all other command-line options of the workflow. Storing a loader as an external property is done using the method SBL::Modules::Module_based_workflow::add_loader

The following program loads the points from a file specified in the command-line options, and fills the search engine :

Two things should be noticed :

There are two types of registration in a workflow : module registration consists on creating a node in the workflow, while loader registration simply means calling the loader when calling the method SBL::Modules::Module_based_workflow::load


Note that the helper of the executable has been enriched with the options provided by the loader.

Structure and modification of workflows

This section describes the algorithm processing the workflow, the semantics of the vertices and edges of the workflow, and how to alter the default behavior of the workflow processing.

The basic traversal algorithm

The workflow is represented by a directed graph where vertices are modules and arcs are connections between modules. Note that the graph contains also the start vertex that is not associated to any module : it is used just as a start point for the workflow processing. As mentioned in section Introduction , the algorithm used for processing the workflow is a variant of a depth first traversal algorithm. The algorithm uses a recursion stack initialized with the start vertex, and executes the following steps while the stack is not empty :

  • the top element of the stack is popped and inspected : for any vertex which is not the start vertex, its corresponding module is executed – the start vertex has no associated module.
  • for each out-going module, the target module is initialized with the data from the module of the current vertex, and pushed on the top of the stack.

As we shall see below, this basic algorithm can be tuned using specific modules and the associated operations. To describe them, we first list the data structures used for vertices and edges of the workflow.

Data structures

what are the different vertices (module_base, predefined, user defined, etc...)

same for edges

Operations on workflow

Operations are functionality of the workflow that alter the base behavior of the traversal algorithm. These operations are necessary for fully representing the SBL applications. There are five operations, represented by keywords :

  • OR : a vertex is pushed on the recursion stack each time one of its ancestors is visited (this is the default behavior).
  • AND : a vertex is pushed on the recursion stack only if its two in-going vertices have been visited.
  • OPT : when visiting a vertex, the associated module is executed iff a user-defined tag has been set to true; if not, the out-going vertices are not pushed on the stack.
  • IF : when visiting a vertex, only one of its out-going vertices is pushed on the stack, depending on a predicate value.
  • FOR : when visiting a vertex, the associated module will be duplicated and executed for a determined set of input data.
The OR operation corresponds to the default behavior of the traversal algorithm : in other words, it is simply represented by multiple in-going edges on a module.


Except the OR operation, all operations modify the algorithm behavior. In the following, we describe how the algorithm is modified for each operation, and how to use these operations.

AND : conjunction module

Rationale.

Remind that when a module have several in-going modules, it is visited immediately after execution of one of its in-going module, corresponding to the OR operation. However, steps of the workflow may require that all steps before have been already executed. To enforce the visit of a module only after all its in-going modules have been visited, one has to resort on conjunction modules.

Use case.

A conjunction module is represented by the class SBL::Modules::T_Module_conjunction . Practically, the AND operation requires

SBL::Modules::Module_based_workflow::Vertex conj = this->make_conjunction(u, v);
  • (ii) creating an edge from the conjunction module to one or several output module(s).

Accessing the in-going modules is done through the methods SBL::Modules::T_Module_conjunction::get_conjunction_1 and SBL::Modules::T_Module_conjunction::get_conjunction_2 .

OPT : optional module

Rationale.

In a workflow, some of the steps may be optional, i.e some modules should be executed only if a user specifies so. We need a way to add a command-line option for the user to specify if the module should be executed or not.

Use case.

A boolean tag is associated to all vertices of the workflow's graph via an external property. This association is done using a map from the vertices to the tag. Note that vertices are represented by successive integers so that access to the tag of a vertex is done in constant time. When processing the stack of the workflow, if the tag of the top vertex is false, the vertex is simply discarded with no processing.

By default, the tag of all vertices is true. However, when a module is optional, the default value of the tag is false. It is only after the command-line options are parsed that a tag of an optional module can be set to true.

An existing module is made optional using the method SBL::Modules::Module_based_workflow::make_optional_module . This method turns the default tag value to false, and adds a corresponding command-line option. Considering the previous example on spatial search engines, to make the module Spatial_search_module optional, the following line is added to the construction of the workflow in the previous example :

this->make_optional_module(v, "search-engine", "Run the spatial search module.");

IF : conditional module

Rationale.

To implement an If-Then-Else on modules, we use a specific module equipped with a predicate, so as to choose which next module should be executed.

Use case.

A conditional module is created with the method SBL::Modules::Module_based_workflow::make_condition

This previous method creates an object of type SBL::Modules::T_Module_condition , which has up to two out-going modules (to implement an If or an If-Then-Else).

The method SBL::Modules::Module_based_workflow::make_condition requires up to height arguments :

  • the vertex u representing the in-going module in the workflow,

  • the predicate P that is a functor taking as argument a pointer to to the in-going module, and returning a boolean value;

  • the vertex v corresponding to the action to be executed if the predicate evaluates to True,

  • the updater of the input from u to v (as the functor in the method SBL::Modules::Module_based_workflow::make_module_flow),

  • (optional) the vertex w corresponding to the action to be executed if the predicate evaluates to False,

  • (optional) the updater of the input from u to w (as the functor in the method SBL::Modules::Module_based_workflow::make_module_flow),

  • (optional) a name for the predicate for displaying the condition module in the printed workflow, if any,

  • (optional) a name to display along the arc from u to the vertex of the condition module.

The following example involves a conditional loop to build an approximate spatial search engine for a collection of 2D points (loaded from a file here), meeting specific criteria. It implements a module computing the sum, over all points from the DB, of the distances each point to its nearest neighbor. If the sum is less than a threshold, the algorithm halts; otherwise, it rebuilds the search engine.

Two used features call for the following comments:

  • The class Is_invalid_engine is the predicate governing the re-construction: its input is the module before the condition, and it uses the output of this module for computing the predicate value.

  • The method SBL::Modules::Module_based_workflow::make_condition creates the conditional module and the arcs linking the different involved vertices. Note that there is no module associated to the Else statement – nothing happens if the predicate evaluates to False. Note also that the method has explicit template parameters which are the input module type, and the output module type. In the case where a second output module exists, its type is added to the template parameters of the method.

FOR : collection module

Rationale.

We wish to repeat the execution of a given module on a collection of input data. This objective is achieved by creating a collection of instances of the module of interest.

Use case.

A collection of modules is an instance of the template class SBL::Modules::T_Modules_collection < Module , SetIndividualInput > . Such a module is registered with the method SBL::Modules::Module_based_workflow::register_module . The template parameters are as follows:

  • Module is the base module to be repeated.
  • SetIndividualInput is a functor taking as input a module of type Module , and an input data structure. This functor sets the input of the module with the given data structure.

When the collection of modules is initialized, the method SBL::Modules::T_Modules_collection::set_individual_inputs calls the previous functor for each of its input data (using an iterator). For each such input, a new instance of the module of interest is created (dynamic instance created in C++ with new), and initialized with this input using the functor SetIndividualInput . Note that the type of the second parameter of the functor must match the value type of the iterator.

When the collection of modules is executed in the workflow, each instance of the modules in the collection is executed.

(Advanced) If the C++ macro SBL_OPENMP is defined, the modules are executed in a parallel for loop using OpenMP. CMake configuration files of the SBL handle all OpenMP dependencies by specifying the CMake variable SBL_OPENMP when configuring a project.


One can iterate on the created modules with the methods SBL::Modules::T_Modules_collection::modules_begin and SBL::Modules::T_Modules_collection::modules_end, or access them directly with the method SBL::Modules::T_Modules_collection::get_module .

The following example loads a set of 2D point collections, and builds a spatial search engine for each such collection.

Two used features should be stressed:

  • the class Set_points is the initialization functor. Note that it always has two arguments, the first one being the module, the second one being a user-defined data structure containing the input.

  • the method initialize uses the method SBL::Modules::T_Modules_collection::set_individual_inputs.

Developing Modules

In developing a module, several virtual functions can be customized by developers.

Functions modifying the workflow behavior

Implementing a module requires specializing the base class SBL::Modules::Module_base, which implements all basic functionalities. The class SBL::Modules::Module_base is a pure virtual containing the pure virtual method SBL::Modules::Module_base::run. This method is the one called to execute the module within the workflow, and thus needs to be implemented.

In the sequel, we present the remaining functionalities of modules. Note that these functionalities are virtual methods from the class Module_base. Thus, as virtual methods, they may be redefined by the programmer.

SBL::Modules::Module_base::statistics : printing the statistics of the run.

After a module has been executed, it is possible to group all the statistics related to the calculations in the method SBL::Modules::Module_base::statistics . In particular, this method is called only for particular values of the verbose mode (2 and 3). This is particularly useful for time-consuming statistics. By default, this method does nothing.

SBL::Modules::Module_base::is_runnable : checking if a module can be executed.

When particular cases are not handled, the execution of a module may not produce the expected output : in such a case, the modules downstream cannot be executed. The method SBL::Modules::Module_base::is_runnable provides a mechanism to check whether the input of a module is correctly set. By default, this method returns always true.

SBL::Modules::Module_base::report : reporting module output into files.

Recording output data of a module into files is done re-implementing the method SBL::Modules::Module_base::report. The argument of this method is the prefix generated by the workflow class (see section Customizing Applications). By default, this method does nothing.

Note that the method SBL::Modules::Module_base::report is not called by default in the workflow : to be called, the module has to be set as an "end" module, meaning that it produces final output that should be recorded. To set a module as an "end" module, the workflow method SBL::Modules::T_Module_based_workflow::set_end_module should be used on the vertex representing the module. It is also possible to change the way the workflow reports the output into files through the command-line option –report-at-end: either right after the execution of a module (default), either after all modules have been executed.

SBL::Modules::Module_base::set_module_instance_name : naming duplicated modules.

When duplicating a module in the workflow (e.g, within a collection of modules), the output prefix provided by the workflow for reporting the output of a module is the same for every modules of the collection : this means that the reported data will be overwritten while reporting the different instances of the same module.

In that case, a specific name can be given to each instance using the method SBL::Modules::Module_base::set_module_instance_name . A given instance can then be retrieved using the method SBL::Modules::Module_base::get_module_instance_name . This is also useful for identifying the instances on the log during the execution of the workflow. By default, this method returns the empty string.

Workflow visualization

SBL::Modules::Module_base::get_name : interacting with modules documentation in the Applications user manuals.

Each user manual from Applications generally provides a workflow interactive image of the concerned application. This image is based on Graphviz and corresponds to a ".dot" file produced by the executables of the application itself (see command-line option –workflow).

By interactive, we refer to the fact that the image contains hyper-references for each node (i.e. module) of the workflow. By default, every hyper-link points to the user manual of Module_base. To specify another hyper-reference, e.g the reference manual or the user manual of the module, one needs to redefine the virtual method SBL::Modules::Module_base::get_name, returning the string corresponding to the class name or to the package name. The corresponding manual is inferred from this string.