Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Protein_sequence_annotator

Authors: R. Tetley and F. Cazals

Introduction

When processing large amounts of protein sequences, having indicators on certain sites of interest, such as transmembrane parts or binding sites can be interesting. As an example, one could want to find proteins wich contain a transmembrane region that is at least 30 amino-acids long. Such is the purpose of sequence annotations, allowing users to search for certain characteristics in annotated sequence. We provide a set of two Python modules: one which defines annotated sequences and provides some standard annotators and one to filter a set of protein sequences using properties on these annotations.

Pre-requisites

An annotation on a protein sequence is a triplet composed of a feature key, a residue sequence number range (or list) and a description. $\text{\uniprot}$ provides a list of standard features: http://www.uniprot.org/help/sequence_annotation


  • The feature key defines the type of feature
  • The range locates the feature on the sequence
  • The description (optional) gives additional information on the feature

As an example, the EFF1 protein (http://www.uniprot.org/uniprot/G5ECA1) contains:

  • A Transmembrane region (feature key)
  • Between residues 556 and 576 (residue range)
  • Which is Helical (optional description)

Python modules

Sequence annotations

The Python module SBL/Sequence_annotators.py provides the SBL::Sequence_annotators::Annotated_sequence class. Such an object should be initialized with a name as well as a fasta sequence. An annotator should then be used to add sequence annotations.

Annotations using Phoebius

phoebius is a combined transmembrane topology and signal peptide prediction method [102] , based upon profile Hidden Markov models.

The Python module SBL/Sequence_annotators.py , provides the class SBL::Sequence_annotators::Phobius_annotator which uses the $\text{phoebius}$ executable to annotate a sequence.

An annotator has a single function, annotate, which takes an annotated sequence object in argument, and produces annotations by exploiting its fasta sequence

For example, the phobius annotator, when given an annotated sequence object, will write its fasta sequence to a file, run the $\text{phoebius}$ exectuable, and parse the results. These results will be used to annotate the sequence.

import re #regular expressions
import string
import sys #misc system
import os
#This class runs the Phobius executable to annotate a sequence
__result_line_re__ = re.compile("(FT)\s+(DOMAIN|TRANSMEM|SIGNAL|TOPO_DOM|REGION)\s+(\d+)\s+(\d+)\s+(CYTOPLASMIC|NON\sCYTOPLASMIC|N-REGION|H-REGION|C-REGION)?")
__standard_feature_keys__ = {"DOMAIN" : "Domain", "TOPO_DOM" : "TDomain", "SIGNAL" : "Signal", "REGION" : "Region", "TRANSMEM" : "Transmembrane"}
__standard_descriptions__ = {"CYTOPLASMIC" : "Cyto", "NON CYTOPLASMIC" : "Non-cyto", "N-REGION" : "N", "H-REGION" : "H", "C-REGION" : "C", "" : ""}
def annotate(self, annotated_sequence):
phobius_cmd = "phobius.pl tmp.fasta &> tmp.phobius"
rm_cmd = "rm tmp.fasta tmp.phobius"
tmp_fasta = open("tmp.fasta", "w")
tmp_fasta.write(">%s\n%s\n" % (annotated_sequence.name, annotated_sequence.fasta_seq))
tmp_fasta.close()
os.popen(phobius_cmd).read()
phobius_file = open("tmp.phobius")
for line in phobius_file:
result = self.__result_line_re__.search(line)
if result:
feature_key = result.group(2)
range_begin = int(result.group(3))
range_end = int(result.group(4))
feature_description = ""
if result.group(5) != None:
feature_description = result.group(5)
feature = sf.Feature(self.__standard_feature_keys__[feature_key], (range_begin, range_end), self.__standard_descriptions__[feature_description])
annotated_sequence.add_feature(feature)
phobius_file.close()
os.popen(rm_cmd).read()
Definition: Phobius_annotator.py:1
Users can design their own annotator by creating a functor object which has a member function annotate which fills the features field of the annotated sequence.


Sequence filters

Through the SBL/Sequence_filters.py module, we provide a set of filters which allow to filter a set of annotated sequences by using criterions on their annotations.

A sequence filter object is a functor containing the filter member function, which takes an annotated sequence as argument and returns true if the given sequence follows the defined restrictions.

For example, the SBL::Sequence_filters::Transmembrane_filter class will simply look for a "Transmembrane" feature in the annotated sequence.

import re #regular expressions
import string
import sys #misc system
import os
import Sequence_annotators
def filter(self, annotated_sequence):
return self.is_transmembrane(annotated_sequence)
def is_transmembrane(self, annotated_sequence):
if annotated_sequence.get_features("Transmembrane"):
return True
return False
Definition: Transmembrane_filter.py:1
Users can define their own filters.


Example

In this example, we parse a fasta file and annotate each listed fasta sequence using the $\text{\phoebius}$ annotator.

We then use the SBL::Sequence_filters::Class_II_filter class to assess wether the given sequence is a viable class II fusion protein candidate.

#! /usr/bin/python3
import re #regular expressions
import string
import sys #misc system
import os
from optparse import OptionParser
from SBL.Sequence_filters import *
name_line_re = re.compile(">([A-Za-z0-9\.]+)")
fasta_line_re = re.compile("^[A-Za-z]+")
parser = OptionParser()
# parse the results for a unique file
parser.add_option("-f", "--fasta", dest="fasta", type="string", help="Input the fasta file")
(options, args) = parser.parse_args()
if not options.fasta:
sys.exit("You must provide a valid fasta file")
else:
#read the fasta file
fasta_sequences = []
fasta_file = open(options.fasta)
name = "No name"
for line in fasta_file:
name_line = name_line_re.search(line)
fasta_line = fasta_line_re.search(line)
if name_line:
name = name_line.group(1)
if fasta_line:
fasta_sequences.append((name, line))
name = "No name"
fasta_file.close()
#create annotator and filter
phobius = Phobius_annotator()
class_II = Class_II_filter()
ann_seq_list = []
#analyse each sequence contained in the file
for name, fasta in fasta_sequences:
ann_seq = Annotated_sequence(name, fasta)
phobius.annotate(ann_seq)
if class_II.filter(ann_seq):
print("Probable class II candidate: %s" % (ann_seq.name))
Definition: Sequence_annotators.py:1
Definition: Sequence_filters.py:1