Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
End-User: Tutorial

Tutorial guiding an end-user for using the SBL programs.

Introduction

This tutorial aims to explain how to use the programs developed within SBL. It is organized as follows:

  • the section Running Programs from the SBL explains generalities on the run of programs, from the location of the programs to the meaning of the options common to all programs.
  • the section Visualization explains the different existing visualization tools that are provided to visually inspect the results of calculations.

Note that it is assumed that the SBL library was installed as specified in the Installation Guide.

Running Programs from the SBL

In the following, we describe where are located the SBL programs and the basic options that are common to all the SBL programs.

Programs: Location

If the cmake variable SBL_APPLICATIONS was set to ON during the installation, all the SBL programs have been compiled and copied in the target bin directory. If no target was specified, the standard path is assumed (e.g /usr/bin) such that the SBL programs are automatically added to your path. If not, you can add the target bin directory to your path (i.e to add the path to your bin directory to the environment variable PATH).

In the SBL, recall that an application is a package for end-users, providing programs solving a specific biophysical problem, using inputs of different kinds, yet carrying the same semantic.

Programs: Naming Convention

Program names in the SBL consists in three parts:

  • First, the prefix sbl- allows one to easily locate all programs in a shell using the completion.
  • Second, the name after sbl- corresponds to the generic name of the application: all the programs starting with this same prefix solve the same problem. For example $\text{\sblvorlumetxt}$ and $\text{\sblvorlumepdb}$ compute the surface area and the volume of the input.
  • Third, the suffix indicates for which C++ model (intuitively: data type, see Models for more details) the program is made for. For example, -txt for a list of 3D balls in a plain text file, or -pdb for a molecule in a PDB file). For example, the program $\text{\sblintervorabw}$ computes the interfaces between two partners mediated by water molecules (ABW) where the input is a family of atoms (atomic), while the program $\text{\sblintervorigagw}$ computes the interfaces found in antibody-antigen complex mediated by water molecules (IGAgW) where the input is a family of atoms (atomic).

Programs: General Options

All the programs use a workflow which consists of interconnected Modules. These modules provide command-line options.

The first and possibly most important option is –help that prints on the standard output the list of all options of the program. All the options are grouped by module, with two exceptions:

  • General Options: lists all the options that are not contextual, and that are also common to all programs,
  • Optional Modules: lists tags for running modules that are optional, if any (an optional module will not be run by default).

We now describe the general options common to all programs:

  • –config-file : all options available for a given program can be specified in a text file whose format is as follows:
option-name-1 = option_value_1
option-name-2 = option_value_2
...
  • –workflow : a tag triggering the dump of the workflow into a file in .dot format. Once fed to the Graphviz software, as explained in the section Graphviz (Graph visualization), the user gets an image containing the graph of the workflow of the program. For example, generating the JPEG image file example_workflow.jpg from the workflow dot file of the program $\text{sblvorlumepdb}$ is done as follows – see Fig. fig-example-vorlume-workflow for the output graph:
> dot -Tjpg sbl-vorlume-pdb__workflow.dot -o example_workflow.jpg
  • –log : a tag for redirecting the standard output of the program into a plain text file. The name of this file is always terminated by __log .
  • –verbose : a tag to obtain various pieces of information (high level statistics associated to the different calculations, details on the run of the program and in particular the sequence of modules called and the time spent in each of them, ...) Usually, is verbose is off, the program is silent.
  • –colored-log : a tag for coloring the log in the standard output, making it much more readable. Note that the coloration depends on the output and will work only on the standard output (e.g, it does not work with the option –log). The different steps of the application are shown in red, the statistics in green, and the report for the various steps in blue.
  • –directory-output </path/to/output/directory> : an option for indicating the directory hosting the output files of the program. By default, it is the current directory. Note that the path given must correspond to an existing directory.
  • –output-prefix [prefix] : an option for prefixing all the output files in a personalized manner and informative. If not specified, all the output files will be prefixed only by the name of the program. However, when the same program is run multiple times, it could be necessary to identify to which run a file is associated. When using the option –output-prefix without any argument, a prefix composed by the name of the program together with the options used is automatically generated. If an input prefix is given, all the output files will be prefixed only by this input prefix. Note that if the input prefix is "IMPLICIT", the effect will be the same as using the option without any argument.
  • –uid : the option –output-prefix may be not enough for specifying a unique prefix for the files. For example, if a program has to be run several times with the exact same options (e.g. to assess the incidence of randomness), the output files will be overwritten. This option triggers the addition of a suffix that uniquely identifies a run of a program. It is composed by the date and the time (at a millisecond scale) of the execution of the program.

(Bio-)molecules: data and format

Biomolecules: PDB and mmCIF formats


PDB format. Quoting the PDB, « the PDB format provides a standard representation for macromolecular structure data derived from X-ray diffraction and NMR studies » .

See full details at Protein Data Bank Contents Guide , or an executive summary to deal with coordinates.

One difficulty with PDB files is the lack of topological information, that is chemical bonds between the atoms whose coordinates are listed. For standard amino acids in proteins, the connectivity is naturally known from the chemical structure of each side chain. Non standard connectivity information is provided in the CONNECT section of the PDB file.

PDB files from simulations sometimes merely contain the list of atoms. The parsing of files provides in the SBL, using libcifpp, requires a one-line PDB header, such as e.g.
HEADER    DNA BINDING PROTEIN/DNA                 30-APR-01   1IJW   



Macromolecular Crystallographic Information File (mmCIF) format. Quoting mmCIF Wikipedia page, the « mmCIF was designed to address limitations of the PDB format in terms of capacity and flexibility, especially with the increasing size and complexity of macromolecular structures being determined. »

See also Beginner’s Guide to PDB Structures and the PDBx/mmCIF Format.

The SBL uses libcifpp to parse PDB and mmCIF files.

Beyond the PDB format

There is a vast array of molecular formats, and a reference conversion tool is Open babel.

The focus of the SBL being on biomolecules rather than chemicals, the MOL format is used despite a number of shortcomings.

A precise specification of MOL can be found here.

Creating molecules with the pymol builder. A very convenient strategy to create molecules consists in using the builder from Pymol.

Creating molecules using JSME and Jmol. Another simple strategy to create 3D molecular models of simple molecules operates in two steps. (One can for example follow the following tutorial.)

  • First, the JSME editor can be used to design the topology of the molecule. A 3D model can then be exported in MOL file format (mouse right click).

  • Second, the model can be copy-pasted and then edited into a 3D molecular viewer / editor such as Jmol. Within jmol, hydrogen atoms can be added and the structure minimized, and finally exported into the MOL format.

    Note also that the File > Get Mol functionality of jmol allows one to pull molecules, as retrieved by the PubChem structure search engine or as resolved by the NCI/NIH chemical structure resolver. As an example, one may download aspirine or caffeine !

Post-processing and Analysis of Results

Most of the programs from the Applications produce XML files for standard software. This section shows how to process these files:

The XML output files of the SBL programs can be parsed easily using the package PALSE, which provides python functions to load XML files having the same hierarchy of tags, collect selected data with the possibility to use some graphical and/or statistical tools (2D plots, histograms, etc...). Using PALSE reduces a possibly long python script for analyzing the results to few lines.

For a detailed description of the package and a presentation of the main functionalities, the reader is referred to the user manual of PALSE.

Getting the common hierarchy of elements in a dataset of XML files

The output XML files of the SBL programs may have complicated hierarchies, and / or large numbers of tags. For example, the program $\text{\sblvorlumepdb}$ outputs an XML file listing the volumei n a per-atom base: for a molecule with thousands of atoms, each atom being annotated with tens of properties, the output XML file will be heavy to read.

PALSE offers a simple utility for summing up the common hierarchy of a dataset of XML files. More precisely, given a dataset of XML files, PALSE will load all the XML files, compute for each XML file a tree representing its hierarchy, and compute the largest common subtree of all constructed trees. Two methods allow to access to this common hierarchy in two different ways:

  • PALSE::PALSE_xml_DB::get_common_hierarchy_as_xml : this method will return a string that represents the common hierarchy as an XML file,
  • PALSE::PALSE_xml_DB::get_common_hierarchy_as_list : this method will return a string that lists all the possible paths from the root tag in the common hierarchy, with one path per line.

Hence, visualizing the common hierarchy is simply done by printing the output of one of these methods.

Analyzing a dataset of XML files

Once the hierarchy is obtained, one wants to analyze the dataset of XML files: PALSE offers a simple methodology in three steps:

  • loading the dataset of XML files using the method PALSE::PALSE_xml_DB::load_from_directory ,
  • finding the information in the common hierarchy using all the accessors provided by the class PALSE::PALSE_xml_DB ,
  • analyzing the information using statistics and graph tools provided by the class PALSE::PALSE_statistic_handle .

Visualization

Most of the programs from the Applications produce visualization files for standard software. In the following, we comment on various visualization facilities:

Molecular Visualization Softwares

Generalities

The visualization softwares are used to visualize various output results, be they collections of atoms extracted from PDB files, or geometric constructions related to the particular problem being solved (molecular interfaces, binding patches, coarse-grain approximations, etc). When a calculation generates an output to be visualized, an option containing the keyword viewer is available. Such an option can be set to none (no output), vmd (VMD) or pymol (PyMOL). Since visualization files may be heavy, those options are set to none by default. If set for one of the aforementioned systems, a visualization file of the corresponding format will be produced when reporting the results.

When visualizing the results using a visualization software, two kind of objects fundamentally different may be seen:

  • the selection of atoms of the input molecule(s), requiring a preloaded molecule in the visualization software from which the selection is performed,
  • the geometric objects created during the run of the program, requiring that the visualization software creates new objects rather than selecting existing ones. As a simple example, one may consider the Voronoi interface in a complex, which consists in polygons separating the partners (see the application Space_filling_model_interface).

While the selections are dynamical (once it is loaded, a user can modify the selection), the geometric objects are static (once it is loaded, the geometric objects cannot be modified).

VMD (Visual Molecular Dynamics)

VMD is a molecular visualization program for displaying molecules using 3-D graphics. In addition to the visualization files in VMD format, the SBL library also provides a number of plugins for VMD that can be automatically installed as specified in section Installing VMD plugins . Two important plugins are:

  • Fast Load : a plugin to load visualization state files faster than with the default load option. This is mandatory for the VMD files created by programs of the SBL, which typically contains detailed geometric descriptors.
  • Atomic Group Radii : a plugin to modify the radii of the atoms of the current VMD molecule. To see why, recall that SBL library use specific radii for the atoms, in particular so-called group radii. If no particular group radii is specified, the default one used in the SBL is used, that may be different from the one used in VMD (see Atomic Radii and Group Radii for more details on the group radii in the SBL). It is also possible to add a constant value to all radii, e.g 1.4 $\AA$ for the SAS model.

PyMOL (Python Based Molecular Visualization System)

The program PyMOL is a molecular visualization system on an open-source foundation. See PyMOL user manuals for more details. The visualization files for PyMOL are python scripts that can be run directly from PyMOL .

Graphviz (Graph visualization)

Graphviz is a graph visualization software especially useful to plot various types of diagrams . Input files follow the dot file format, and different programs allow to generate pictures of the graphs such as dot (for a hierarchical embedding), circo (for a circular embedding), etc... The dot file format is used in our Applications when an output is a graph, or for visualizing the workflow of an application.