Template C++ / Python API for developping structural bioinformatics applications.
User Manual
Script_design
Authors:F. Cazals
Python scripting for computer experiments
Goals. This package provides guidelines and simple classes to write python scripts which are simple, effective, versatile and maintainable.
The typical setup is to handle calculations for individual files located in a given directory, for all files contained in a directory, and to store the results coherently.
As examples, one may consider the processing of set of PDB files, of a set of txt/csv files representing input matrices for numpy, etc.
Wishlist. As suggested by its logo, this package proposes a coherent framework to handle:
Command line options (CLI options): provide a well document set of command line options, enabling calculations in a variety of situations ranging from a unique file to an entire directory.
Parallelism: enable calculations in parallel to exploit all cores of a machine.
Output files: Deliver property named serialized files / XML files
User interaction: enable user interactions and suitable feedback mode–possibly logged in a log file–even if the parallel mode is activated.
Statistical analysis: provide a statistical analysis mode exploiting the computed XML files.
We assume the files generated by the execution of all instances are twofold: one file per individual run, and files (plots, csv files, etc) presenting statistics for all runs.
Tex report: possibly provide a report mode delivering a tex report containing statistics, tables and figures, ready for compilation.
Terminology. The following terms are used in the sequel:
One run and the class Params_my_application: a single execution be managed via a specification encoded in an instance of the class Params_my_application.
My_algorithm : the class implementing the algorithm at hand, whose execution if fully determined by the parameters available in one instance of Params_my_application.
My_application: the class tying all pieces together, in particular the execution and the statistics.
Design
Overview
Our solution hinges on a careful management of the options passed to the script–the aforementioned params_one_run, and the way they are stored/used by the script.
The goal is to setup a class Params_my_application handling coherently all the requirements mentioned in the previous section.
The design of this class – Fig. fig-params-hierarchy, uses multiple inheritance, and two types of class initialization provided by python:
an explicit initialization with init(), taking a list of explicit arguments.
an automatic initialization using a variable number of arguments passes via the **kwargs mechanism. Importantly, the dictionary converted to **kwargs is the one returned by the command line parser.
SBL::Params_base::Params_IOI – for InputOuput_Interaction: the base class handling (i) paths and filenames for outputs, and (ii) all interaction parameters ie verbose/view/logfile. Note that the latter requires in particular the inference of the logfile filename – which depends upon the input parameters and in particular the input filename processed.
SBL::Params_base::Params_base : the base class to inherit from to enjoy automatic handling of output and interaction modes
Params_my_application(Params_base, Params_zero): the class used to store all the parameters used for one run. It inherits from Params_base, Params_zero, which are properly initialized as detailed below.
Options of an application, and the hierarchy of classes used to handle parameters of an application.
The rationale to use two different init mechanisms for the classes from Fig. fig-params-hierarchy is as follows:
init() is used when a small number of arguments is involved.
**kwargs is used when a potentially larger number of arguments is provided. It also setting and accessing the data members with the functions
The class SBL::Params_handler::Params_IOI manages the input filename information and the output file path. These pieces of information are used to provide a unified and consistent mechanism to infer the filenames into which dumps are performed.
As evidenced by the code below, two important mechanisms are:
Output directory path stored in self.m_odpath. Instantiating the class requires an input file (via ifpath) or an explicit (absolute/relative) output directory. The output directory path is set from three options:
case 1: absolute output directory passed: use it,
case 2: relative output directory path passed: use as subdir of self.m_odpath,
case 3: output directory is self.m_idpath + "/results".
Function get_ofpath() to obtain an output file path. This important function is used to obtain the filepath i.e. directory + filename for a given output. This functions combines four pieces of information:
the output directory just discussed,
the input filename,
the parameters of the run,
the specific suffix used for this dump, for exaple -stat.txt –diagram.png –features.csv –graph.odt, etc.
The SBL::Params_base::Params_IOI class provides a function log_exists() which is very useful in case one would like to relaunch calculations for cases where something went wrong and the logfile was not generated. This function is naturally inherited by the user's class Params_my_application.
User interaction
We assume the user interaction is specified as follows:
verbose_level : an integer from 0 to 3:
0: non information dumped whatsoever
1: main steps
2: with intermediate info
3:with full details
The specification of what is high level or not is left at the discretion of the developer.
view : a boolean flag stating whether interactive displays should be activated – e.g. 3D graphics or plots
logflag : to indicate whether the information dumped according to the verbose level should be so in the terminal or in a log file. This latter functionality is implemented bu the write() function – which is compatible with multiprocessing.
These data members are explicitly initialized from the constructor, see below.
Assembling the final class
Let us recap how these pieces are glued together – see snippet below:
Write an options parser and pass the options parsed to the constructor of My_application..
In the function run() of that class, collect the files to be processed from the CLI. Then, create one instance of Params_my_application by passing the arguments as follows:
(i) explicitly the arguments to handle input/output (class Params_IOI) and interaction (class Params_interaction)
(ii) the remaining arguments via options (class Params_my_application)
Run the individual calculations from the class My_application. </li
Statistical analysis. We assume the results of all execution are available as:
Case 1: a list of instances of a class Results, see Section Parallel execution,
Case 2: a list of files, say xml files, generated by the individual runs. In that case, their content is easily parsed using xml tools, see the PALSE.
In both cases, it is straightforward to compute statistics, plots, etc, and dump them property into files whose paths are provided by any instance of params_one_run – via inheritance from the class Params_IO.
Tex reporting. Assume the execution of My_application has generated a list of tex files. On can simply invoke the function figs_in_tabular() from the class SBL::SBL_pytools::SBL_pytools_latex to generated complex inclusion of figures within tabular environments for example.
Examples
As illustrations, the reader may consult:
This package. The script $SBL_DIR/Applications/Script_design/scripts/sbl-scriptdesign-test.py a simple script illustrating the previous machinery. Does the following:
Takes as input a list of patterns, e.g. "class" "import" "static" and searched them in a list of files with suffixes in {".txt", ".hpp", ".cpp", ".py"}.
Stores the number of occurrences of each pattern in each file in a dictionary.
Serializes the dictionary as xml.
Here is the self-execution and its output
sbl-scriptdesign-test.py --idpath SBL_DIR/Applications/Script_design --patterns class import elif