Structural Bioinformatics Library Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Authors: F. Cazals and T. Dreyfus and A. Lheritier and N. Malod-Dognin

# Goals: Alleviating Data Parsing and Analysis

## On Data Perenniality, Availability, and Analysis

A tenet of Science is the ability to reproduce the results, and a related issue is the possibility to archive and interpret the raw results of (computer) experiments. This package presents an elementary python framework focusing on the archival and the analysis of raw data, with three goals in mind:

• Raw data perenniality. Make the raw data perennial by ensuring that anyone can make sense out of them even once decoupled from the program that generated them. This is non trivial since the program and scripts / other programs parsing the output need to co-evolve.
• Raw data availability. Make the perennial raw data accessible, to allow novel analysis by scientists equipped with different methodological tools.
• Raw data parsing and analysis. Ease the parsing of the raw data, so as to prepare the graphical and statistical analysis.

To see how these goals are met, consider a computing pipeline consisting of raw data generation, raw data parsing, and data analysis i.e. graphical and statistical analysis. PALSE addresses these last two steps by leveraging the hierarchical structure of XML documents.

More precisely, assume that the raw results of a program are stored in XML format, possibly generated by the serialization mechanism of the boost C++ libraries.

For raw data parsing, PALSE imports the raw data as XML documents, and exploits the tree structure of the XML together with the XML Path Language to access and select specific values. For graphical and statistical analysis, PALSE gives direct access to ScientificPython, R, and gnuplot.

In a nutshell, PALSE combines standards languages (python, XML, XML Path Language) and tools (Boost serialization, ScientificPython, R, gnuplot) in such a way that once the raw data have been generated, graphical plots and statistical analysis just require a handful of lines of python code.

The framework applies to virtually any type of data, and may find a broad class of applications.

# Prerequisites

## The XML Hierarchical Representation

XML provides a hierarchical representation of a document, which may be seen as a rooted ordered tree of nodes called Elements – see Fig. fig-palse-xml-tree. Each Element is characterized by a tag-name, and possibly contains attributes, a text field and a number of child Elements — there is also an optional tail string which is not supported by PALSE):

The tag-name of an Element corresponds to the name of the start-tag (<tag-name>) and the end-tag (</tag-name>) of in its XML representation.

An attribute of an Element is a variable having a particular value defined in its XML representation with the start-tag of (<tag-name\ att='value'>).

The text field of an Element is defined as any text in-between the start-tag and the end-tag of in its XML representation, which is not embedded in-between another (start-tag, end-tag) pair (<tag-name> text-field </tag-name>).

The children of are the Elements which are defined in its XML representation in-between the start-tag and the end-tag of (<tag-name> <child-tag-name>\ </child-tag-name> </tag-name>).

In the sequel, by data value, we refer either to the value of an attribute or of a text field.

An Element may be also embedded in a namespace, that is represented by a pair (prefix, uri). The uri is a string of characters identifying a name or a resource, and the prefix is a simple name associated to the uri. Thus, if an Element having a tag-name tag is embedded in a namespace (prefix, uri), its tag-name is represented in the XML file by prefix:tag.

However, in the XML tree, all prefixes are replaced by their corresponding uri, so that the tag name of the Element is represented in the XML tree by {uri}tag.

The XML trees handled by PALSE Note that a tree $T$ is recursively defined from Elements

## The XPath Query Language

Given an XML document, the XPath language has been developed to address specific parts of this document – see the XPath documentation page. Recall that an XML document is represented as a rooted ordered tree: a XPath query always specifies a path to one or more of its Element(s).

An XPath is specified with sequences of tags such as to select elements with the name that are children of elements with the name . This sequence can be enriched to target specific elements in the XML document (see the python documentation page and XPath documentation page for a full specification).

To define relative searches, let the contextual element be the current element in a path:

• refers to elements named that are children of the contextual element (Nb: one descends one level.)
• refers to elements named that are children of any children of the contextual element (Nb: one descends two levels.);

The following are absolute searches i.e. start the search at the root:

• refers to elements named that are children of the root element;
• refers to elements named that are descendants of elements named ;
• refers to the attribute named of any element;
• refers to all elements named having an attribute named ;
• refers to all elements named having an attribute named with a value ;
For example, gets all elements named that are descendants of the current element, and that have an attribute

PALSE assumes that a given (computer) experiment yields one file storing the results. To make these data perennial and foster their availability, PALSE uses XML, since this language is a standard one, and most importantly, provides an abstract structure amenable to high-level querying and filtering operations.

# Using PALSE : example

In this section, we present a short example in which one computes various statistics from a Database storing surface areas and volumes of atoms in molecules, computed using tools from the package Space_filling_model_surface_volume.

## Representation used

In PALSE, each result file is loaded into an XML ETree, so that one gather data by traversing paths in this tree.

Consider for example a case where one has generated an XML file to describe the volume of the restrictions of the atoms to their Voronoi cells. First, to inspect the hierarchy embedded in an XML file, one can use the method SBL::PALSE::PALSE_xml_DB::get_common_hierarchy_as_xml for generating a string representation of this hierarchy:

<boost_serialization>
<restrictions>
<count></count>
<item>
<hierarchical_archive_index></hierarchical_archive_index>
<total_area></total_area>
<is_buried></is_buried>
<particle>
<insertion_code></insertion_code>
<atom_serial_number></atom_serial_number>
<residue_name></residue_name>
<atom_name></atom_name>
<chain_identifier></chain_identifier>
<residue_sequence_number></residue_sequence_number>
<annotations>
<dynamic_annotations>
<count></count>
</dynamic_annotations>
<name></name>
</annotations>
</particle>
<total_volume></total_volume>
</item>
</restrictions>
<exposed_particles></exposed_particles>
<total_area></total_area>
<total_volume></total_volume>
<total_particles></total_particles>
<buried_particles></buried_particles>
</boost_serialization>
Note that a database has to be created first, even with only one XML file. If the database contains several XML files, only the common hierarchy from the root of the XML files will be shown.

It is also possible to access to the list of all paths from the root of XML files rather than the XML format, using the method SBL::PALSE::PALSE_xml_DB::get_common_hierarchy_as_list :

/boost_serialization
/boost_serialization/restrictions
/boost_serialization/restrictions/count
/boost_serialization/restrictions/item
/boost_serialization/restrictions/item/hierarchical_archive_index
/boost_serialization/restrictions/item/total_area
/boost_serialization/restrictions/item/is_buried
/boost_serialization/restrictions/item/particle
/boost_serialization/restrictions/item/particle/insertion_code
/boost_serialization/restrictions/item/particle/atom_serial_number
/boost_serialization/restrictions/item/particle/residue_name
/boost_serialization/restrictions/item/particle/atom_name
/boost_serialization/restrictions/item/particle/chain_identifier
/boost_serialization/restrictions/item/particle/residue_sequence_number
/boost_serialization/restrictions/item/particle/annotations
/boost_serialization/restrictions/item/particle/annotations/dynamic_annotations
/boost_serialization/restrictions/item/particle/annotations/dynamic_annotations/count
/boost_serialization/restrictions/item/particle/annotations/name
/boost_serialization/restrictions/item/total_volume
/boost_serialization/exposed_particles
/boost_serialization/total_area
/boost_serialization/total_volume
/boost_serialization/total_particles
/boost_serialization/buried_particles

## Values versus elements

The previous also shows that values can be accessed in two ways from an XML file, that is:

• Directly via a specification of the tag leading to the values
• In two steps via (1) the specification of the tag leading to elements, and (2) the extraction of the desired values from the elements.

This duality is especially relevant for serialized containers such as those from the STD library. On the previous example, one gets:

# Accessing values directly
tmp = database.get_all_data_values_from_database("restrictions/item/total_volume")
volumes = PALSE_DS_manipulator.convert_listoflists_to_list( tmp )
# Accessing elements, and then values within elements
elements = PALSE_DS_manipulator.convert_listoflists_to_list(database.get_all_elements_from_database("restrictions/item"))
volumes = []
for el in elements:
volumes.append( PALSE_DS_manipulator.convert_listoflists_to_list(database.get_all_data_values_from_elements([el], "total_volume")) )

# Algorithms and Methods

This section describes more precisely the five different steps used in the PALSE methodology. These five steps are summarized on Fig. fig-palse-workflow.

The PALSE workflow The five steps: (1) the native data types of the application(s) are first serialized into XML archives, stored into XML files. It is assumed that one file is generated for each (computer) experiment (2) \palse\ loads these XML files into databases of XML trees (3) Raw data are extracted from the databases using XPath queries (4) these data are prepared (sorted, filtered, etc) (5) the manipulated data are used to perform various data analyses (statistics, etc).

## Step 1: Raw Data Generation

By raw data we refer to the results of some (computer) experiment—PALSE can be used to handle data generated by any device, or even manually archived. Archives and their (de-)construction is a tenet of PALSE following the Boost library serialization engines. As specified in the Boost documentation,

we use the term "serialization" to mean the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes. Such a system can be used to reconstitute an equivalent structure in another program context.

That is, an archive refers to a specific rendering of the aforementioned sequence of byte, and PALSE takes as input XML archives.

For computer experiments, the generation of XML data naturally depends on the programming language used, two of them being of particular interest: C++ and python.

In C++, the Boost Serialization tools accommodate the native C++ types and the data structures of the Standard Template Library.

In python, the recursive structure of dictionaries makes conversion of dictionary into an XML tree (and vice-versa) a trivial task.

## Step 2: Raw Data Parsing and Database Creation

Consider a set of XML archives, one per (computer) experiment. PALSE imports each such file as an XML tree. A database corresponds to a set of isomorphic trees corresponding to experiments generated with varying parameters. Therefore, several databases may be used to accommodate several sets of parameters, or to store results from different sources. The creation of the trees from the files is delegated to the lxml library (lxml is an easy-to-use library for handling XML and HTML in Python, see the documentation page.

## Step 3: Raw Data Querying

Given the XML trees stored in database(s), the retrieval of the values of interest for graphical and statistical analysis combines features of the XML and XPath, as discussed in section The XPath Query Language .

Functions offered in PALSE . The three XPath modes discussed in section The XPath Query Language are used by PALSE functions to query the database, in order to retrieve selected Elements or their data values (either from text fields or attributes):

For data values, if the XPath query ends with a tag, it is a text field. Otherwise, it ends with an attribute and the data value is the value of this attribute.

While queries on data values return directly a list that can be treated with simple statistical tools, queries on Elements return a list that can be used to filter even more the Elements. Given a list of Elements , a XPath query , a data value and a comparator , the method SBL::PALSE::PALSE_xml_DB::filter_elements_by_data_values_compare_to ( , , , ) selects the Elements of such that the specified Xpath from has a data value positively compared with using , a functionality also provided for basic comparators (e.g, SBL::PALSE::PALSE_xml_DB::filter_elements_by_data_values_lower_than ( , , )).

We also note that upon calling a function returning Elements, the selection can be converted into strings thanks to the function SBL::PALSE::PALSE_xml_DB::get_data_values_from_elements ( , ), with and the arguments described above.

PALSE using BioPython : BioPALSE. PALSE offers also functionality handling PDB files treated by the BioPython module. There are three main functionality allowing to convert the PDB files into XML trees:

• SBL::PALSE::BioPDB_vs_XML_Etree::load_PDB_files_from_directory (input_dir, regex = ""): it returns a database of XML trees corresponding to the list of PDB files in the input directory that matches the input regular expression. Note that if no regular expression is given, all the PDB files in the input directory are treated.

## Step 4: Data Manipulation

The data collected by the previous step typically need to undergo processing before being amenable to analysis. Since the previous step supplies python lists of strings, PALSE provides mechanisms to select, sort, and convert these lists into lists of elementary types. Furthermore, PALSE provides tools for combining lists into dictionaries, for further filtering using regular expressions (for strings) or lists of allowed keys or values.

## Step 5: Data Analysis

The lists just produced are ready for graphical and statistical analysis. PALSE encapsulates functionalities from SciPython and R for computing Pearson, Spearman or Kendall's tau correlations between the collected values, as well as Mann-Whitney rank-sum test. More generally, any package interface with python can be used, e.g. gmpy if unlimited-precision integers / rationals / floats must be used to ensure numerical correctness, or gnuplot to generate eye-candy plots and charts, etc.

section sec-code Programming
Here is a small example of how to serialize a simple data structure into an archive:
//nvp for name-value-pair: allows to attribute a name to a
//serializable value. In this context, it is used for decomposing the
//serialization of a data structure
#include <boost/serialization/nvp.hpp>

//type of the archive file of the serialization: here, it is an output
//xml file (we will save the serialization in this file)
#include <boost/archive/xml_oarchive.hpp>

#include <fstream>
#include <iostream>

//Small data structure that we want to serialize
struct Data_type
{
int index;
};

//Definition of the serialize method for our small data structure. It
//takes an archive and load or save the serializable fields of the data
//structure, as described in the method. The symbol "&" means that the
namespace boost
{
namespace serialization
{
template <class Archive>
void serialize(Archive& ar, Data_type& data, const unsigned int version)
{
ar & boost::serialization::make_nvp("index", data.index);
}
}
}

//The main method creates an object of type Data_type, and then save
//it in a serialized xml archive.
int main()
{

Data_type data; data.index = 1;

//first, open the file, then open the file as an output xml archive.
std::ofstream out("test.xml");
boost::archive::xml_oarchive* ar = new boost::archive::xml_oarchive(out);

//save the data in the archive
*ar << boost::serialization::make_nvp("data", data);

//first close the archive, then close the file (the order is
//important, since some additional information are dumped in the file
//when closing the archive)

delete ar;
out.close();

return 0;
}

# Jupyter demo

For our application, each XML file generated contains, amongst other things, the surface area ( <total_area></total_area>) and the volume ( <total_volume></total_volume>) associated with each atomic restriction.

The attached jupyter notebooks illustrates various features of PALSE to extract the following statistics:

• Average volume of atomic restrictions by increasing expansion radius
• Average volume of atomic restrictions by increasing expansion radius
• Average area of specific residue type
• Average Volume by Atom Property
• Distribution of Residue Volumes

For the details, check out:

• Jupyter notebook file
• PALSE

# PALSE: Data Parsing and Analysis¶

## Run calculations - prepare the data to be analyzed¶

We use the application Space_filling_model_surface_volume, see https://sbl.inria.fr/doc/Space_filling_model_surface_volume-user-manual.html, to compute surface areas and volumes of molecules.

Note that for each input PDB file, we define one space filling model for two different expansion radii:

• r=0 : the usual van der Waals model
• r=1.2 : the usual solvent accessible (SAS) model.

The output files for r=0 and r=1.4 a dumped into two distinct directories. While this is done manually here, the process can be automated using the Batch_manager, see https://sbl.inria.fr/doc/Batch_manager-user-manual.html

In [5]:
import re  #regular expressions
import sys #misc system
import os
import pdb
import shutil # python 3 only

exe  = "sbl-vorlume-pdb.exe"
odir = "results"

idir = "data"
files = ["1avw.pdb", "1brs.pdb"] #, "1dfj.pdb"]

def compute_surfaces_volumes_of_pdb_files():
os.system( ("mkdir %s" % odir))
for ifile in files:
cmd = "%s -f %s/%s --directory %s --verbose --output-prefix --log --no-water --radius %s --report-options" % (exe, idir, ifile, odir,radius)
print
os.system(cmd)

compute_surfaces_volumes_of_pdb_files()


## Average volume of atomic restrictions¶

As a first application of PALSE, we compute the average volume of atomic restrictions for the SAS model. This involves the following steps:

• creating an empty database,
• looking for the volume of each restriction,
• making the average of all volumes.
In [7]:
from SBL import PALSE
from PALSE import *

def ave_volume_atoms():
database = PALSE_xml_DB()
volumes = database.get_all_data_values_from_database("restrictions/item/total_volume", float)
volumes = PALSE_DS_manipulator.convert_listoflists_to_list(volumes)
return sum(volumes)/float(len(volumes))

print("Average volume of atomic restrictions:",   ave_volume_atoms())

XML: 2 / 2 files were loaded

Average volume of atomic restrictions: 23.504931290964475


The following comments are in order:

• the search path does not contain the global tag , since by default, the XML tree is represented by its root.

• If the database is partitioned over several subdirectories, it is possible to recursively load all the XML files using the method load_from_directory(input_dir, regex, True)

• One XML file listing several volumes, and a database possibly containing several XML files, the method get_all_data_values_from_database always returns a list of lists. If one does not want to discriminate in between the different XML files, the method convert_listoflists_to_list returns a unique concatenated list from an input list of lists.

• In the example above, the files involved in the database do not obey any specific ordering. As explained below, the function database.sort_databases can be used to sort these files based on a regular expression acting on their names.

## Average volume of atomic restrictions by increasing expansion radius¶

As a second example, we replicate the previous analysis, distinguishing the two radii used.

This requires two features:

• adding the option --report-options to the run of the SBL executable, allowing to report an XML file listing all the options used for the calculations;

• sorting the database of loaded XML files w.r.t a given option using the method PALSE_xml_DB::sort_databases.

In [8]:
def exple_average_volume_by_increasing_radius():

database = PALSE_xml_DB()
volumes = database.get_all_data_values_from_database("restrictions/item/total_volume", float)
volumes = PALSE_DS_manipulator.convert_listoflists_to_list(volumes)

database = PALSE_xml_DB()
volumes = database.get_all_data_values_from_database("restrictions/item/total_volume", float)
volumes = PALSE_DS_manipulator.convert_listoflists_to_list(volumes)

XML: 2 / 2 files were loaded

XML: 2 / 2 files were loaded



@Tom: we do not sort anymore !???

The following comments are in order:

• The sort is done over strings by default. To change this behaviour, the second positional argument (or the type argument) can be set to any comparable type.

• The XML archives containing all the options of a run of a SBL executable always terminate by "__options.xml". However, if for any reason the files listing the options are terminated differently, the third positional argument (or the options_suffix argument) can be set to any file suffix.

## Average area of specific residue type¶

It is also possible to analyze specific elements amongst the data, such as the volume of a specific residue type, or the volume of an residue identified by its sequence number. To do so, three steps are required :

• gathering all elements in the database at the desired level; for example, if the analysis undertaken focuses on residues, one has to gather all elements corresponding to residues; this is done with the method PALSE_xml_DB::get_all_elements_from_database that returns one list of elements for each dataset in the database;

• filtering elements matching a given requirement; for example, assume that one wishes to select elements corresponding to residues of type "GLY", or only in the chain "A"; this is done through the method PALSE_xml_DB::filter_elements_by_data_values_equal_to that returns a filtered sublist of elements from the input list of elements;

• accessing desired values from the filtered elements; for example, for each filtered residue, one may want to access its identifiers, volume and area; this is done with the method PALSE_xml_DB::get_all_data_values_from_elements that returns a list of values matching the input XPath query for each element in the input list of elements;

The following script shows how to print the average area for all non-buried glycines using the three methods:

In [9]:
def exple_average_area_specific_residue_type():

database = PALSE_xml_DB()
residues = database.get_all_elements_from_database("residues/item")[0]
gly_residues = database.filter_elements_by_data_values_equal_to(residues, "residue_name", "GLY")
areas = database.get_all_data_values_from_elements(gly_residues, "area", float)
areas = [x for x in PALSE_DS_manipulator.convert_listoflists_to_list(areas) if x > 0]
print("Average residue free surface area of GLY with added radius 1.4 A: ", sum(areas)/float(len(areas)))

exple_average_area_specific_residue_type()

XML: 2 / 2 files were loaded

Average residue free surface area of GLY with added radius 1.4 A:  28.34496145848905


## Average volume by atom_property¶

It is also possible to compute the volume statistics of atoms sharing a common property. For example, one wants to know the average volume of carbons depending on the residue containing it. The python script is similar to the one presented in section Example Python Script : Average Volume of Atoms, but one has to collect separately all the carbons of each residue type instead of just collecting all the atoms together. This is done in three steps:

• collecting all the particles in the database,
• collecting all the existing residues from the particles,
• for each such residue, filtering the particles having the same residue name and that are carbons.

This is done with the following python script:

In [10]:
def exple_average_volume_atom_property():

database = PALSE_xml_DB()
particles = database.get_all_elements_from_database("restrictions/item")
particles = PALSE_DS_manipulator.convert_listoflists_to_list(particles)
residue_names = database.get_all_data_values_from_elements(particles, "particle/residue_name")
residue_names = sorted(set(PALSE_DS_manipulator.convert_listoflists_to_list(residue_names)))
for resname in residue_names:
particles_of_residue = database.filter_elements_by_data_values_equal_to(particles, "particle/residue_name", resname)
carbons_of_residue = database.filter_elements_by_data_values_equal_to(particles_of_residue, "particle/atom_name", "C")
volumes = database.get_all_data_values_from_elements(carbons_of_residue, "total_volume", float)
volumes = PALSE_DS_manipulator.convert_listoflists_to_list(volumes)
if len(volumes) > 0:
print("Average volume of carbons in %s" % resname, sum(volumes)/float(len(volumes)))
else:
print("Average volume of carbons in %s" % resname, 0)

exple_average_volume_atom_property()

XML: 2 / 2 files were loaded

Average volume of carbons in ALA 16.368102941176474
Average volume of carbons in ARG 17.742399999999996
Average volume of carbons in ASN 16.992057692307696
Average volume of carbons in ASP 17.447016949152548
Average volume of carbons in CYS 16.6221875
Average volume of carbons in GLN 17.622068181818182
Average volume of carbons in GLU 17.067339285714286
Average volume of carbons in GLY 18.512360465116274
Average volume of carbons in HIS 15.875799999999998
Average volume of carbons in ILE 15.723208333333332
Average volume of carbons in LEU 16.553563218390803
Average volume of carbons in LYS 16.868150000000004
Average volume of carbons in MET 17.078249999999997
Average volume of carbons in PHE 16.27151612903226
Average volume of carbons in PRO 16.87785294117647
Average volume of carbons in SER 18.09426582278481
Average volume of carbons in THR 17.31942857142857
Average volume of carbons in TRP 16.125375000000002
Average volume of carbons in TYR 16.373214285714283
Average volume of carbons in VAL 15.844982456140343


## Distribution of residue volumes¶

PALSE offers also the possibility to plot simple diagrams as 2D plots or histograms. For example, it is possible to plot the distribution of the volumes of the residues. To do so, the following steps are required:

collecting all the particles in the database per input XML file,

for each XML file, (i) collecting all the existing residues ids from the particles, (ii) classifying the particles per residue id, and (iii) summing the volumes of the particles of each residue,

plotting the histogram of the volumes of the residues.



This is done with the following python script:

In [11]:
def distribution_residue_volumes():

database = PALSE_xml_DB()
residue_volumes = []
particles = database.get_all_elements_from_database("restrictions/item")
for particles_per_file in particles:
chain_identifiers = database.get_all_data_values_from_elements(particles_per_file, "particle/chain_identifier")
chain_identifiers = sorted(set(PALSE_DS_manipulator.convert_listoflists_to_list(chain_identifiers)))
for chain in chain_identifiers:
particles_per_chain = database.filter_elements_by_data_values_equal_to(particles_per_file, "particle/chain_identifier", chain)
residue_sequence_numbers = database.get_all_data_values_from_elements(particles_per_chain, "particle/residue_sequence_number", int)
residue_sequence_numbers = sorted(set(PALSE_DS_manipulator.convert_listoflists_to_list(residue_sequence_numbers)))
for i in residue_sequence_numbers:
particles_of_residue = database.filter_elements_by_data_values_equal_to(particles_per_chain, "particle/residue_sequence_number", str(i))
volumes_of_particles = database.get_all_data_values_from_elements(particles_of_residue, "total_volume", float)
volumes_of_particles = PALSE_DS_manipulator.convert_listoflists_to_list(volumes_of_particles)
residue_volumes.append(sum(volumes_of_particles))

PALSE_statistic_handle.hist2d(residue_volumes, "residue_volumes_histogram.ps", 'Volume_of_Residues', 16, 20)
print("Histogram stored in residue_volumes_histogram.ps")

distribution_residue_volumes()

XML: 2 / 2 files were loaded

Histogram stored in residue_volumes_histogram.ps