FunChaT

Authors: R. Tetley and F. Cazals

Goals

Goals. In this package, we consider the problem of protein function annotation and prediction from structural and sequence information. In short: given a set of polypeptide chains involved in the same function, we wish to:

extract sub-sequences accounting for this function,

use these sequences to query a database of sequences, to ideally find proteins with the same function.

These questions are especially challenging when the sequences associated with the structures provided have low sequence identity – say < 20%. Our solution to this problem uses the identification of structurally conserved motifs, see [41] and Structural_motifs.

In a nutshell, given two proteins, a structural motif consists in two sets of a.a., one on each structure, such that the $\lrmsd$ of the motif is significantly smaller than that associated with a global alignment between the two structures.

Overview of the method. Given a (small) set of structures, our method consists of three main steps:

(Step 1) The first step collects structural motifs for all pairs of structures. See the package Structural_motifs.

(Step 2) Second, the sequences of the proteins and the sub-sequences of the motifs are used to compute multiple sequence alignments (MSA), from which profile hidden Markov models (HMM) are built. See also the package HMMER_Wrapper.

Note that combining sub-sequences and whole sequences is meant to bias the MSA towards the important regions, while yet retaining information on linkers connecting these regions.

(Step 3) The HMM are used to query the database of protein sequences and functional annotations UniProt. The obtained hits are filtered, so as to retain the sequences with properties related to the function studied. See also the pacakge DB_manipulator .

Pre-requisites

In the following, we detail the three steps just sketched.

Step 1: Structural motifs

Given two proteins, a structural motif consists in two sets of a.a., one on each structure, such that the $\lrmsd$ of the motif is significantly smaller than that associated with a global alignment between the two structures.

Note that two proteins typically generate several motifs, which can be filtered using motif inclusion and statistical significance. See the package Structural_motifs for the details.

Step 2: Multiple sequence alignments, profile HMM

Parameterized Consensus sequences

Suppose the previous step yields a set $\calM$ of motifs. To each structural motif , we asspociate its $\lrmsd$ ratio . We build $\calR$ , the set of all $\lrmsd$ ratios.

Consider a structure $i \in S$ together with a $\lrmsd$ ratio $r \in \calR$ . The parameterized consensus sequence $\pconsSeq{i}{r}$ is defined as the sequence of this structure, into which every amino-acid position not involved in any motif with $\lrmsd$ ratio less then

is replaced by a gap. The set of all parameterized consensus sequences is denoted $\pconsSeq$ .

Motifs and PCS for the Rift Valley Fever virus class II fusion protein

a) The structure of the Rift Valley Fever virus class II fusion protein. Residues that belong to a motif which have a $\lrmsd$ ratio $r \leq 0.8$ are displayed as red spheres.

b) The sequence associated to the structure on the left. Residues that belong to a motif which have a $\lrmsd$ ratio $r \leq 0.8$ are highlighted in red.

c) The previously defined $\pconsSeq{i}{r}$ with being the class II fusion protein of RVFV and .

Hybrid Multiple Sequence alignments and hidden Markov models

From the previously defined , we compute hybrid multiple sequence alignments. To do so, for a given ratio and $\forall i \in \calS$ , we build $PCS_{\leq r}^i$ . We then instantiate the multiple sequence alignment with the previously built as well as the full sequence of each structure.

The multiple sequence alignment is performed using the Clustal Omega algorithm ([160]), or alternatively, the Muscle algorithm ([81]).

From the hybrid multiple sequence alignment, we build hidden Markov models. These models are used to query databases (notably the $\text{\uniprot}$ DB). We use HMMER – see also HMMER .

This step is performed by using the HMMER_Wrapper package.

Step 3: Database queries and filtering

Querying $\text{\uniprot}$ with a HMM yields a list of hits identified by their accession codes (as defined in DB_manipulator).

For a given hit, we use the DB_manipulator package and its DB_hits_manager.py module to recover the taxonomic information from the NCBI Taxonomy database and its fasta sequence. Each hit can be filtered by an e-value which assesses its significance.

To filter results, we annotate the sequence through the Protein_sequence_annotator package.

We are now able to filter the results through a selection of taxonomic or sequence criterions (see Protein_sequence_annotator and DB_manipulator for more details).

Bootstrapping

After performing Step 3, and filtering the hits, we obtain a set of protein sequences. This set can be used to bootstrap the process: we add these new sequences to the initial set and build a new HMM. We use this new HMM to perform a new query.

Using FunChaT

We provide two executables:

The first one corresponds to the first step (described above)
The second one packages both steps two and three.

Step One Input: Specifications and File Types

The main input of sbl-FunChaT-step-one.py is a folder containing the pdb files for each of the scrutinized structures.

Note that the executable makes use of the Batch_manager package. Default options are instantiated for Structural_motifs in the FunChaT-default.spec file. The user should take the time to fine tune these options.

A computation is launched as follows:

> sbl-FunChaT-step-one.py -p data/pdb -s data/FunChaT-default.spec -o results --parallel 4

Regardless of the size of the input molecules, due to the Apurva computation time, see Apurva , the example run takes of the order of two hours on a laptop computer.

The main options of the program sbl-FunChaT-step-one.py are:
-p string: PDB files directory
-s string: Specification file for Structural_motifs options
-o string: Output directory
–parallel (Optional) string: Create specified number of parallel batches

File Name

Description

PDB File

A .pdb file containing one of the processed structures. The script should be directed to a folder containing all PDB files of analyzed structures.

Spec File

A .spec file containing the specification of the run.

Input files for the run described in section Step One Input: Specifications and File Types .

Step One Output: Specifications and File Types

The output consists of a list of folders each containing the output files of the Structural_motifs package.

Step Two and Three Input: Specifications and File Types

The main input of sbl-FunChaT-step-two-and-three.py is a folder containing the results of Structural_motifs. The results of each pairwise comparison are bundled in a .xml file. The file for each of these comparisons should be contained in an independent folder with the standard name sbl-structural-motifs-*-pdb-fileStructure1-pdb-fileStructure2 (as per the Batch_manager default nomenclature).

Additionally, the user should provide a file containing, for each structure: its full fasta sequence, its name, a residue sequence number range corresponding to the residues contained in the sequence. (See example provided in the second line of the table below.)

A computation involving steps 2 and 3 is launched as follows:

> sbl-FunChaT-step-two-and-three.py -r results -f data/sequences.fasta -d ALL -t 0.8 -o results/hmm_results -a $UNIPROT_FASTA_DIR -u $UNIPROT_DIR -n $NCBI_DIR

Or if a user would prefer to only perform step two:

> sbl-FunChaT-step-two-and-three.py -r results -f data/sequences.fasta -d ALL -t 0.8 -o results/hmm_results -a $UNIPROT_FASTA_DIR -u $UNIPROT_DIR -n $NCBI_DIR --no-step-3

And later step three:

> sbl-FunChaT-step-two-and-three.py --hmm_res_file data/sequences.hmm  -a $UNIPROT_FASTA_DIR -u $UNIPROT_DIR -n $NCBI_DIR

Note that the process can be bootstrapped:

> sbl-FunChaT-step-two-and-three.py -r results -f data/sequences.fasta -d D2 -t 0.7 -o results/hmm_results -a $UNIPROT_FASTA_DIR -u $UNIPROT_DIR -n $NCBI_DIR --bootstrap 1 -m

The main options of the program sbl-FunChaT-step-two.py are:
-r string: Structural motifs results directory
-f string: Fasta file containing the full sequence of all partners
-d string: (Optional) The user can further decompose the sequence per domain (note that the previous step must have been run using domain specification files, see Structural_motifs)
-t float: The lRMSD threshold for the PCS
-o string: The directory in which the output should be dumped
-a string: The directory containing the fasta dump of the Uniprot DB
-u string: The directory containing the Sqlite Uniprot DB
-n string: The directory containing the Sqlite NCBI Taxonomy DB
–no-step-3: (Optionnal) Do not perform step three of FunChaT
–hmm_res_file string: (Optionnal) Parse a previously computed HMM and perform step three of FunChaT
–bootstrap int: Perform the specified number of bootstrap steps
-c int: Use the specified number of cores for a parallel computation
-m: Use the MUSCLE aligner instead of Clustal Omega

File Name	Description
XML file	An .xml file containing the results of a rigid blocks run between two structures (PDB accession codes: 1OK8 and 1RER). The script should be directed to a folder containing all such comparisons.
FASTA file	A fasta file containing all the sequences in fasta format
HMM file	A previously computed HMM

Input files for the run described in section Step Two and Three Input: Specifications and File Types.

Step Two and Three Output: Specifications and File Types

File Name	Description
Plain text file	An .txt dump containing the list of accession codes obtained upon querying UniprotKB with the HMM.
Plain text file	A .txt dump containing the list of FILTERED accession codes upon querying UniprotKB with the HMM.
Fasta file	A .fasta file containing the sequences used to perform the MSA.
Alignment	A .aln file containing the Clustal alignment used to build the HMM.
HMM	A file containing the built HMM.

Output files for the run described in section Step One Input: Specifications and File Types .

Programmer's Workflow

FunChaT: programmer's workflow

External dependencies

This package uses binaries from the following packages:

Muscle : a multiple sequence alignment program
Clustal-Omega: a multiple sequence alignment program
HMMER: a biosequence analysis tool using profile hidden Markov models
Phobius: a combined transmembrane topology and signal peptide

Table of Contents