Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Script_design

Authors: F. Cazals

Python scripting for computer experiments

Goals. This package provides guidelines and simple classes to write python scripts which are simple, effective, versatile and maintainable.

The typical setup is to handle calculations for individual files located in a given directory, for all files contained in a directory, and to store the results coherently.

As examples, one may consider the processing of set of PDB files, of a set of txt/csv files representing $n\times d$ input matrices for numpy, etc.

Wishlist. As suggested by its logo, this package proposes a coherent framework to handle:

  • Command line options (CLI options): provide a well document set of command line options, enabling calculations in a variety of situations ranging from a unique file to an entire directory.
  • Parallelism: enable calculations in parallel to exploit all cores of a machine.
  • Output files: Deliver property named serialized files / XML files
  • User interaction: enable user interactions and suitable feedback mode–possibly logged in a log file–even if the parallel mode is activated.
  • Statistical analysis: provide a statistical analysis mode exploiting the computed XML files.

We assume the files generated by the execution of all instances are twofold: one file per individual run, and files (plots, csv files, etc) presenting statistics for all runs.

  • Tex report: possibly provide a report mode delivering a tex report containing statistics, tables and figures, ready for compilation.

Terminology. The following terms are used in the sequel:

  • One run and the class Params_my_application: a single execution be managed via a specification encoded in an instance of the class Params_my_application.
  • My_algorithm : the class implementing the algorithm at hand, whose execution if fully determined by the parameters available in one instance of Params_my_application.
  • My_application: the class tying all pieces together, in particular the execution and the statistics.

Design

Overview

Our solution hinges on a careful management of the options passed to the script–the aforementioned params_one_run, and the way they are stored/used by the script.

The goal is to setup a class Params_my_application handling coherently all the requirements mentioned in the previous section.

The design of this class – Fig. fig-params-hierarchy, uses multiple inheritance, and two types of class initialization provided by python:

  • an explicit initialization with init(), taking a list of explicit arguments.
  • an automatic initialization using a variable number of arguments passes via the **kwargs mechanism. Importantly, the dictionary converted to **kwargs is the one returned by the command line parser.

This said, the main python classes provided are :

  • SBL::Params_base::Params_IOI – for InputOuput_Interaction: the base class handling (i) paths and filenames for outputs, and (ii) all interaction parameters ie verbose/view/logfile. Note that the latter requires in particular the inference of the logfile filename – which depends upon the input parameters and in particular the input filename processed.
  • Params_my_application(Params_base, Params_zero): the class used to store all the parameters used for one run. It inherits from Params_base, Params_zero, which are properly initialized as detailed below.

Options of an application, and the hierarchy of classes used to handle parameters of an application.

The rationale to use two different init mechanisms for the classes from Fig. fig-params-hierarchy is as follows:

  • init() is used when a small number of arguments is involved.
  • **kwargs is used when a potentially larger number of arguments is provided. It also setting and accessing the data members with the functions
setattr(self, key, value)
getattr(self, aname, None)

Input output filenames

We handle filenames with the class SBL::Params_base::Params_IOI, see Fig. fig-params-hierarchy.

The class SBL::Params_handler::Params_IOI manages the input filename information and the output file path. These pieces of information are used to provide a unified and consistent mechanism to infer the filenames into which dumps are performed.

As evidenced by the code below, two important mechanisms are:

Output directory path stored in self.m_odpath. Instantiating the class requires an input file (via ifpath) or an explicit (absolute/relative) output directory. The output directory path is set from three options:

  • case 1: absolute output directory passed: use it,
  • case 2: relative output directory path passed: use as subdir of self.m_odpath,
  • case 3: output directory is self.m_idpath + "/results".

Function get_ofpath() to obtain an output file path. This important function is used to obtain the filepath i.e. directory + filename for a given output. This functions combines four pieces of information:

  • the output directory just discussed,
  • the input filename,
  • the parameters of the run,
  • the specific suffix used for this dump, for exaple -stat.txt –diagram.png –features.csv –graph.odt, etc.
The SBL::Params_base::Params_IOI class provides a function log_exists() which is very useful in case one would like to relaunch calculations for cases where something went wrong and the logfile was not generated. This function is naturally inherited by the user's class Params_my_application.


User interaction

We assume the user interaction is specified as follows:

  • verbose_level : an integer from 0 to 3:

    • 0: non information dumped whatsoever
    • 1: main steps
    • 2: with intermediate info
    • 3:with full details

    The specification of what is high level or not is left at the discretion of the developer.

  • view : a boolean flag stating whether interactive displays should be activated – e.g. 3D graphics or plots

  • logflag : to indicate whether the information dumped according to the verbose level should be so in the terminal or in a log file. This latter functionality is implemented bu the write() function – which is compatible with multiprocessing.

These data members are explicitly initialized from the constructor, see below.

Assembling the final class

Let us recap how these pieces are glued together – see snippet below:

  1. Write an options parser and pass the options parsed to the constructor of My_application..

  2. In the function run() of that class, collect the files to be processed from the CLI. Then, create one instance of Params_my_application by passing the arguments as follows:

    1. (i) explicitly the arguments to handle input/output (class Params_IOI) and interaction (class Params_interaction)
    2. (ii) the remaining arguments via options (class Params_my_application)

  3. Run the individual calculations from the class My_application. </li
parser = argparse.ArgumentParser(description='My parser',formatter_class=argparse.ArgumentDefaultsHelpFormatter, allow_abbrev=False)
parser.add_argument("--ifpaths", nargs='*', default=[], dest="ifpaths", help="Input file paths")
...
options = parser.parse_args()
proj = My_application(options)
proj.run()
The class SBL::Params_base::SBL_base_parser can be used as a base parser or to cherry pick selected options.


# Using Params_base::SBL_base_parser, case 1: use as base parser
base_parser = SBL_base_parser.get_base_parser()
parser = argparse.ArgumentParser(description="My specific parser", parents=[base_parser], formatter_class=argparse.ArgumentDefaultsHelpFormatter)
# Using Params_base::SBL_base_parser, case 2: cherry pick selected options
base_parser = SBL_base_parser.get_base_parser()
parser = argparse.ArgumentParser(description="My specific parser", formatter_class=argparse.ArgumentDefaultsHelpFormatter)
# pickup options of interest
for action in base_parser._actions:
# replaced by --idpaths
if '--idpath' in action.option_strings:
continue
if '--repeats' in action.option_strings:
continue
parser._add_action(action)

Classes: selected details

For the reference, we provide the main pieces of codes of the aforementioned classes.

class Params_zero:
def configure(self, needed, **kwargs):
for key, value in kwargs.items():
if key in needed:
setattr(self, key, value)
# create with None if absent from **kwargs
for k in needed:
if self.getattr(k) is None:
setattr(self, key, None)
def getattr(self, aname):
return getattr(self, aname, None)
def get_attr(self, aname):
return getattr(self, aname, None)
@dataclass
class IOI_tuple:
ifpath: str = ""
odpath: str = ""
verbose_level: int = 0
viewflag: bool = False
logflag: bool = False
class Params_IOI:
def __init__(self, ioi_tuple: IOI_tuple):
...
self.configure_paths(ioi_tuple.ifpath, ioi_tuple.odpath)
...
def _get_ofpath(self, params_signature, suffix):
"""
Compose the output file from the four tuple (odpath, ofname, params_sig, suffix).
Note that params_sig is assembled by a derived class to exploit all relevant params in the file naming
"""
# output directory
if not os.path.exists(self.m_odpath):
os.system( f"mkdir -p {self.m_odpath}" ) # -p in case it already exists
# input params
params_sig = re.sub(r"\.", "dot", params_signature)
# finally ofpath
ofpath = "%s/%s-%s-%s" % (self.m_odpath, self.m_ifname_nosuffix, params_sig, suffix)
return ofpath
# write on sys.stdout or into file
# nb: file handle TextIOWrapper will be closed automatically
def write(self, msg):
if self.m_logflag is False:
print(msg)
else:
if self.m_logfile is None:
self.open()
self.m_logfile.write(msg + "\n")
Reselect seeds in reverse order using D2 weighting.
class Params_base(Params_IOI):
def __init__(self, ioi_tuple: IOI_tuple):
Params_IOI.__init__(self, ioi_tuple)
def get_params_signature(self):
return ""
def get_ofpath(self, suffix):
return self._get_ofpath(self.get_params_signature(), suffix)

Parallel execution

To enable parallel calculations, we require:

  • a list params_all_runs of instances of the class Params_my_application, each corresponding to one run;
  • a static function performing one run, say run_one_calculation(), returning the result of the execution.

With this the code is as simple as:

def run_one_calculation(params_one_run):
my_algo = My_algorithm(params_one_run)
return my_algo.run() # or return my_class() if __call__() is implemented

which is invoked with:

pool = multiprocessing.Pool() # Create a multiprocessing Pool
results = pool.map(run_one_calculation, params_all_runs)

Statistical analysis and latex reporting


Statistical analysis. We assume the results of all execution are available as:

  • Case 2: a list of files, say xml files, generated by the individual runs. In that case, their content is easily parsed using xml tools, see the PALSE.

In both cases, it is straightforward to compute statistics, plots, etc, and dump them property into files whose paths are provided by any instance of params_one_run – via inheritance from the class Params_IO.


Tex reporting. Assume the execution of My_application has generated a list of tex files. On can simply invoke the function figs_in_tabular() from the class SBL::SBL_pytools::SBL_pytools_latex to generated complex inclusion of figures within tabular environments for example.

Examples

As illustrations, the reader may consult:

This package. The script $SBL_DIR/Applications/Script_design/scripts/sbl-scriptdesign-test.py a simple script illustrating the previous machinery. Does the following:

  • Takes as input a list of patterns, e.g. "class" "import" "static" and searched them in a list of files with suffixes in {".txt", ".hpp", ".cpp", ".py"}.
  • Stores the number of occurrences of each pattern in each file in a dictionary.
  • Serializes the dictionary as xml.

Here is the self-execution and its output

sbl-scriptdesign-test.py --idpath SBL_DIR/Applications/Script_design --patterns class import elif
Stats analysis for  sbl-scriptdesign-test.py {'class': 4, 'import': 8, 'elif': 1}
...
Stats analysis for  Params_base.py {'class': 29, 'import': 7, 'elif': 0}

Other packages. $SBL_DIR/Core/Cluster_ksubspace/scripts/sbl-sc-model.py : a script to fit so-called spherical clusters.

$SBL_DIR/Applications/PDB_utilities/scripts/sbl-alphafold-dbrun.py : a script to investigate properties of AlphaFold predictions.