Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
|
Authors: R. Tetley and F. Cazals
When working with protein databases such as , it is useful to link a given protein to its corresponding species in a taxonomic tree, e.g. that from . We define:
As an example, the EFF1 fusogen protein has accession code G5ECA1, and taxid 6239 which corresponds to Caenorhabditis elegans.
To map proteins (accession codes) to taxa, a key difficulty is to use non ambiguous information, avoiding redundancies and determining species. Taxonomy information from fails both features. For example, different common names (say the Belterra virus and the Icoaraci virus) are used for the same species (both Rift valley fever phleboviruses).
Fortunately, hits usually contain a NCBI identifier which links to the database. This package provides users with the script sbl_database_creator.py to
The UniProt Knowledge Base ( ) is a database which aims at collecting data on proteins and their functional annotation. In this context, each protein sequence is associated with its co-called accession code, as defined above.
Assume that the xml dump of Uniprot has been obtained. The script sbl_database_creator.py described below instantiates the following fields in a sqlite table:
The class SBL::DB_manipulators::Uniprot_wrapper is provided in the DB_manipulators.py module. Initializing a SBL::DB_manipulators::Uniprot_wrapper object with the path to Sqlite database opens a connection to the database, which is closed when the destructor is called. From this wrapper, a user can access the fasta sequence as well as the NCBI identifier of a protein from its accession code.
The is a nomenclature of all the organisms in the public sequence databases.
In the field of taxonomy, a taxon is a conceptual entity which regroups all organisms that share common traits. A taxon is associated to a taxonomic rank. In this documentation, we use the following non exhaustive list of ranks used in this package, sorted by ancestry:
: provides a custom dump (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/). Download the taxdump.tar.gz archive. The database is designed as a tree. For each node, the script instantiates:
The class SBL::DB_manipulators::NCBI_wrapper is provided in the DB_manipulators.py module. Initializing a SBL::DB_manipulators::NCBI_wrapper object with the path to sqlite database opens a connection to the database, which is closed when the destructor is called. From this wrapper, a user can access the taxonomic information of a protein through an NCBI taxid. Note that the NCBI Taxonomy database is a built as a tree, so each entry has a parent field which is also an NCBI taxid.
We provide the script sbl-database-creator.py script to build the databases locally. This script assumes the user has downloaded the dump of the database they wish to create. The creation of all databases is launched as follows:
> sbl-database-creator.py -d Database -n NCBI_dmp -u Uniprot_dmp
We additionaly provide a class to handle the result of a query:
The class SBL::DB_hit_manager::Hits_manager is provided in the DB_hit_manager.py module. It takes a set of accession codes as argument and instantiates the following fields for each hit:
We have previously defined filters for annotated sequences (see Protein_sequence_annotator). We extend this principle: a hit filter allows the user to filter a list of hits by using criterions on their annotated sequence as well as on their taxonomic information. A hit filter is a functor containing a member function, which takes as argument a hit and returns true if the hit follows the defined restrictions.
For example, the following filter will look for a protein sequence, from an organism which belongs to the "Chordata" taxon, which contains a "Transmembrane" feature:
The class SBL::DB_hit_manager::Hits_manager provides a function which returns all the hits, as well as a function that takes as argument a hit filter and returns filtered hits.
In the following example, we parse a file containing a list of accesion codes. For each code, we display its taxonomic rank, and its name.