Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

DB_manipulator

Authors: R. Tetley and F. Cazals

Introduction

When working with protein databases such as $\text{\uniprot}$, it is useful to link a given protein to its corresponding species in a taxonomic tree, e.g. that from $\text{\ncbitaxo}$. We define:

In $\text{\uniprot}$, the accesion code of a protein is the alpha-numeric string of characters uniquely identifying this protein.


In $\text{\ncbitaxo}$, the taxid of a taxon is the positive integer uniquely identifying this taxon.


As an example, the EFF1 fusogen protein has accession code G5ECA1, and taxid 6239 which corresponds to Caenorhabditis elegans.

To map proteins (accession codes) to taxa, a key difficulty is to use non ambiguous information, avoiding redundancies and determining species. Taxonomy information from $\text{\uniprot}$ fails both features. For example, different common names (say the Belterra virus and the Icoaraci virus) are used for the same species (both Rift valley fever phleboviruses).

Fortunately, $\text{\uniprot}$ hits usually contain a NCBI identifier which links to the $\text{\ncbitaxo}$ database. This package provides users with the script sbl_database_creator.py to

  • build an Sqlite database for $\text{\uniprot}$,
  • build an Sqlite database for $\text{\ncbitaxo}$,
  • query the created databases and incorporate these queries into Python scripts.

UniprotKB

Database

The UniProt Knowledge Base ( $\text{\uniprot}$) is a database which aims at collecting data on proteins and their functional annotation. In this context, each protein sequence is associated with its co-called accession code, as defined above.

Assume that the xml dump of Uniprot has been obtained. The script sbl_database_creator.py described below instantiates the following fields in a sqlite table:

  • $\text{\uniprot}$ accession code.
  • The fasta sequence associated to the given accesion code.
  • The $\text{\ncbitaxo}$ taxid: this characterizes an organism in the $\text{\ncbitaxo}$ database.
As of November 2017, the $\text{\uniprot}$ dump is quite large, weighing approximatively 65Gb.


Wrapper

The class SBL::DB_manipulators::Uniprot_wrapper is provided in the DB_manipulators.py module. Initializing a SBL::DB_manipulators::Uniprot_wrapper object with the path to Sqlite database opens a connection to the database, which is closed when the destructor is called. From this wrapper, a user can access the fasta sequence as well as the NCBI identifier of a protein from its $\text{\uniprot}$ accession code.

NCBI taxonomy

Database

The $\text{\ncbitaxo}$ is a nomenclature of all the organisms in the public sequence databases.

In the field of taxonomy, a taxon is a conceptual entity which regroups all organisms that share common traits. A taxon is associated to a taxonomic rank. In this documentation, we use the following non exhaustive list of ranks used in this package, sorted by ancestry:

(Canonical set of ranks) superkingdom, kingdom, phylum, class, order, family, genus, species.


$\text{\ncbitaxo}$: provides a custom dump (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/). Download the taxdump.tar.gz archive. The database is designed as a tree. For each node, the script instantiates:

  • The $\text{\ncbitaxo}$ taxid: this integer uniquely characterizes a taxon in the $\text{\ncbitaxo}$ database.
  • The name of the taxon (i.e. Vertebrates, Bacteria, etc...).
  • the $\text{\ncbitaxo}$ taxid of its parent rank. If the current node is a root this is the same as its own taxid.
  • The name of the taxonomic rank (i.e. Superkingdom, Kingdom, Phylum, etc..)

Wrapper

The class SBL::DB_manipulators::NCBI_wrapper is provided in the DB_manipulators.py module. Initializing a SBL::DB_manipulators::NCBI_wrapper object with the path to sqlite database opens a connection to the database, which is closed when the destructor is called. From this wrapper, a user can access the taxonomic information of a protein through an NCBI taxid. Note that the NCBI Taxonomy database is a built as a tree, so each entry has a parent field which is also an NCBI taxid.

Script

We provide the script sbl-database-creator.py script to build the databases locally. This script assumes the user has downloaded the dump of the database they wish to create. The creation of all databases is launched as follows:

> sbl-database-creator.py -d Database -n NCBI_dmp -u Uniprot_dmp
The main options of the program sbl-rigid-blocks.exe are:
-d string: Directory to store the Sqlite databases
-n string: Location of the unpacked NCBI dump
-u string: Location of the .xml Uniprot dump


Hit manager

We additionaly provide a class to handle the result of a query:

When querying Uniprot, we recover a list of protein sequences in the form of accession codes. Each protein sequence is called a hit.


The class SBL::DB_hit_manager::Hits_manager is provided in the DB_hit_manager.py module. It takes a set of accession codes as argument and instantiates the following fields for each hit:

  • An annotated sequence (using the Phobius annotator by default), see Protein_sequence_annotator
  • The taxonomic information from NCBI. Since some taxonomic ranks are unique to a given branch in the tree of life, we only instantiate taxon names for the aforementioned canonical group.

We have previously defined filters for annotated sequences (see Protein_sequence_annotator). We extend this principle: a hit filter allows the user to filter a list of hits by using criterions on their annotated sequence as well as on their taxonomic information. A hit filter is a functor containing a member function, which takes as argument a hit and returns true if the hit follows the defined restrictions.

For example, the following filter will look for a protein sequence, from an organism which belongs to the "Chordata" taxon, which contains a "Transmembrane" feature:

import re #regular expressions
import string
import sys #misc system
import os
import Sequence_annotators
def filter(self, hit):
return self.is_transmembrane(hit.annotated_sequence) and hit.taxonomy["phylum"] == "Chordata"
def is_transmembrane(self, annotated_sequence):
if annotated_sequence.get_features("Transmembrane"):
return True
return False
Definition: Chordata_transmembrane_filter.py:1
As for sequence filters (see Protein_sequence_annotator), users can define their own filters.


The class SBL::DB_hit_manager::Hits_manager provides a function which returns all the hits, as well as a function that takes as argument a hit filter and returns filtered hits.

Example

In the following example, we parse a file containing a list of $\text{\uniprot}$ accesion codes. For each code, we display its taxonomic rank, and its name.

#! /usr/bin/python3
# Example code for the DB_manipulators package.
# Parses an input file and returns the species to the corresponding accession codes
import SBL
from SBL import DB_manipulators
from SBL.DB_manipulators import *
from optparse import OptionParser
import os
parser = OptionParser()
# fetch the accession codes in a file
parser.add_option("-f", "--file", dest="file_name", type="string", help="Input a file containing a list of Uniprot accession codes.")
#this script assumes the databases have been created and environment variables
#which link to them have been instantiated
uniprot_dir = SBL.get_env_or_die("UNIPROT_DIR")
ncbi_dir = SBL.get_env_or_die("NCBI_DIR")
(options, args) = parser.parse_args()
if (not options.file_name):
sys.exit("You must provide an input file containing accession codes...")
else:
uni = Uniprot_wrapper(uniprot_dir)
ncbi = NCBI_wrapper(ncbi_dir)
accesion_file = open(accession_file)
for accession in accession_file:
taxid = uni.get_ncbi_taxid(accession)
name = ncbi.get_name(taxid)
rank = ncbi.get_rank(taxid)
print("Prot: %s, Rank: %s, Name: %s" % (accession, rank, name))
Definition: DB_manipulators.py:1
def get_env_or_die(name)
Definition: __init__.py:14