Alphafold_analysis

Authors: F. Cazals and E. Sarti

Alphafold analysis

This section presents the analysis [44] .

The following is taken for granted in the sequel:

The environment variable $afdb points towards a directory containing the AlphaFold Protein Structure Database (AF-DB for short).

For a predicted structure – in the realm of AlphaFold, pLDDT values are stored in the B factor column.

Overview

The script sbl-alphafold-dbrun.py provides the analysis described in the papers [44] and [173] – see also [43] .

The main analysis methods offered are:

mode pLDDT: elementary statistics on pLDDT values of all amino acids of all predicted structures in a directory. (NB: pLDDT values stored in the column of B factors.)

mode filtrations: analysis based on the pLDDT and arity filtrations, for one structure or all structures in a directory.

mode null: analysis based on the pLDDT filtration for a random/null model as specified in the paper.

mode null-pvalue : permutation p-value calculation as specified in the paper, using the null model to specify H0.

Filtrations using the arity and the pLDDT: example on ordered, disordered, and mixed proteins. For a prediction (i.e. for every column), the calculation provides: (i) Arity based filtration: evolution of the number of connected components in the filtration (ii) Arity based filtration: persistence diagram for connected components (iii) pLDDT based filtration: evolution of the number of connected components in the filtration (iv) pLDDT based filtration: persistence diagram for connected components See [44] and [173] for details.

Analysis: filtrations

Calculation on a single file. Invoke the script sbl-alphafold-dbrun.py on the file to be processed. The call generates a variety of analysis, dumped into the subdirectory perfile-gen-arity of the directory containing the model processed.

Example:

sbl-alphafold-dbrun.py -f $afdb/tests/human-AF-Q96MU5-F1-model_v4.pdb -a filtration Dumped /user/fcazals/home/mols-archives/AFDB/tests/perfile-gen-arity/human-AF-Q96MU5-F1-model_v4-struct.png Dumped /user/fcazals/home/mols-archives/AFDB/tests/perfile-gen-arity/human-AF-Q96MU5-F1-model_v4-arity-r10-x-pLDDT.png Dumped /user/fcazals/home/mols-archives/AFDB/tests/perfile-gen-arity/human-AF-Q96MU5-F1-model_v4-UFcc–arity-r10.pdf Dumped /user/fcazals/home/mols-archives/AFDB/tests/perfile-gen-arity/human-AF-Q96MU5-F1-model_v4-UFpd–arity.pdf Dumped /user/fcazals/home/mols-archives/AFDB/tests/perfile-gen-arity/human-AF-Q96MU5-F1-model_v4-UFpd–pLDDT.pdf Dumped /user/fcazals/home/mols-archives/AFDB/tests/perfile-gen-arity/human-AF-Q96MU5-F1-model_v4-UFcc–pLDDT.pdf Dumped /user/fcazals/home/mols-archives/AFDB/tests/perfile-gen-arity/human-AF-Q96MU5-F1-model_v4-afstats-histo.png Dumped /user/fcazals/home/mols-archives/AFDB/tests/perfile-gen-arity/human-AF-Q96MU5-F1-model_v4-afstats.csv

The files generated are as follows:

Filtration using the arity as parameter: prefixed as –arity,
Filtration using the pLDDT as parameter: prefixed as –pLDDT,
Finally, the csv file containing all statistics for the pLDDT and arity based filtrations (format detailed below).

Calculation for the null model. The null model associated with Union-Find can be invoked as follows:

sbl-alphafold-dbrun.py -a null –nm-N 1000

Calculation on a directory. Invoke the script sbl-alphafold-dbrun.py on the directory to be processed. The call generates a summary and scatter plots for the main statistics, and stores them into the subdirectory DIRNAME-gen-arity.

The call also generates the database DIRNAME-afstats.csv, see below.

It is also possible to generate all the analysis seen above for a single file using the option –dpm.

Example:

sbl-alphafold-dbrun.py -d $afdb/MJannaschii -a filtration Starting // calculation for /user/fcazals/home/mols-archives/AFDB/MJannaschii PDB files found No gz file found ... Dumped /user/fcazals/home/mols-archives/AFDB/MJannaschii/MJannaschii-gen-arity/MJannaschii-arity.csv Dumped /user/fcazals/home/mols-archives/AFDB/MJannaschii/MJannaschii-gen-arity/MJannaschii-arity-r10-scatter.png Dumped /user/fcazals/home/mols-archives/AFDB/MJannaschii/MJannaschii-gen-arity/MJannaschii-arity-hmul-r10.png Dumped /user/fcazals/home/mols-archives/AFDB/MJannaschii/MJannaschii-gen-arity/MJannaschii-arity-analysis-r10.tex Dumped /user/fcazals/home/mols-archives/AFDB/MJannaschii/MJannaschii-gen-arity/MJannaschii-afstats.csv Dumped /user/fcazals/home/mols-archives/AFDB/MJannaschii/MJannaschii-gen-arity/MJannaschii-afstats-fcp-x-H-hmap.png Dumped /user/fcazals/home/mols-archives/AFDB/MJannaschii/MJannaschii-gen-arity/MJannaschii-afstats-cpf-x-mp-hmap.png Dumped /user/fcazals/home/mols-archives/AFDB/MJannaschii/MJannaschii-gen-arity/MJannaschii-afstats-mp-x-ent-hmap.png

Database. The previous calculation on a whole directory also generates DIRNAME-afstats.csv,, which contains 19 columns as follows:

Columns 0:name,1: num_aa : model name and number of amino acids.

Columns 2:q25,3:ar25,4:cdf25,5:q75,6:ar75,7:cdf75 : a sequence of q triples, with q the number of quantiles used in the arity signature. Each triple consists of (i) the quantiale value, (ii) the arity value, and (iii) the value of the CDF at the arity value.

Colums 8:FPCP,9:meanP,10:persH : a triple consisting of the fraction of positive critical points, the mean persistence, and the normalized persistence entropy.

Columns : 11:t0dot025,12:PLM0dot025,13:t0dot05,14:PLM0dot05,15:t0dot15,16:PLM0dot15 : a sequence of pairs consisting of the persistence threshold used and the corresponding number of persistent local maxima for the number of connected components of the pLDDT curve. In the example, three thresholds are used, namely .

Columns: 17:NCC_max_arity,18:NCC_max_pLDDT : the maxima of number of connected components observed in the filtrations using the arity and the pLDDT.

Here are the first few lines of MJannaschii-afstats.csv:

#0:name,1:num_aa,2:q25,3:ar25,4:cdf25,5:q75,6:ar75,7:cdf75,8:FPCP,9:meanP,10:persH,11:t0dot025,12:PLM0dot025,13:t0dot05,14:PLM0dot05,15:t0dot15,16:PLM0dot15,17:NCC_max_arity,18:NCC_max_pLDDT name,num_aa,q25,ar25,cdf25,q75,ar75,cdf75,FPCP,meanP,persH,t0dot025,PLM0dot025,t0dot05,PLM0dot05,t0dot15,PLM0dot15,NCC_max_arity,NCC_max_pLDDT AF-P0CL56-F1-model_v4,53,0.25,8,0.26,0.75,13,0.98,0.1698,0.3654,0.1399,0.025,5,0.05,2,0.15,1,-1,6 AF-P0CW38-F1-model_v4,57,0.25,11,0.32,0.75,16,0.75,0.1404,0.9643,0.1267,0.025,5,0.05,2,0.15,1,-1,6 AF-P54013-F1-model_v4,98,0.25,11,0.28,0.75,20,0.77,0.1735,0.4639,0.157,0.025,5,0.05,1,0.15,1,-1,9

Database queries using filters. A filter is a test file such that each line contains three pieces of information: the variable name, and the min and max values (vmin and vmax respectively). Lines staring with a pound are ignored.

Each line is used to define a filter, which is then invoked to decide whether a given entry (protein) of the genome processed is selected or not.

#num_aa vmin:1125 vmax:1150 #NCC_max_pLDDT vmin:230 vmax:250 #persH vmin:0.05 vmax:0.1 #PLM0dot025 vmin:10 #fpcp #persE num_aa vmin:1560 vmax:1700 NCC_max_pLDDT vmin:335 vmax:360

Analysis: pLDDT statistics

The pLDDT analysis aims at providing pLDDT statistics on all the AlphaFold reconstructions in a genome, plotting the distribution of all pLDDT values, of mean values (one mean value per protein), and median values (one median value per protein).

Calculation on a directory. Invoke the script sbl-alphafold-dbrun.py on the directory to be processed. The call generates a summary and scatter plots for the main statistics, and stores them into the subdirectory DIRNAME-gen-pLDDT.

sbl-alphafold-dbrun.py -d $afdb/MJannaschii -a plddt === Processing /user/fcazals/home/mols-archives/AFDB/MJannaschii PDB files found No gz file found Starting // calculation ... Calculation done All pLDDT values for MJannaschii 1773 497291 Dumped /user/fcazals/home/mols-archives/AFDB/MJannaschii/MJannaschii-gen-pLDDT/MJannaschii-pLDDT-all-histogram.png Dumped /user/fcazals/home/mols-archives/AFDB/MJannaschii/MJannaschii-gen-pLDDT/MJannaschii-pLDDT-median-histogram.png Dumped /user/fcazals/home/mols-archives/AFDB/MJannaschii/MJannaschii-gen-pLDDT/MJannaschii-pLDDT-mean-histogram.png Written /user/fcazals/home/mols-archives/AFDB/MJannaschii/MJannaschii-gen-pLDDT/MJannaschii-pLDDT-analysis.tex

Analysis: ABSTRAQT confidence score

Despite its outstanding accuracy, AlphaFold2 still presents a bias towards certain types of local conformations. Notably, it has the tendency to create very long and unphysically bent alpha helices, and unlikely and unstable "beta-helix" structures. Moreover, our analysis on AlphaFoldDB hints at errors in the prediction of intrinsically disordered regions.

Frequently, these errors are not reflected by low pLDDT values. ABSTRAQT is an SVM-powered, arity-based scoring function conceived to detect unphysical local arrangements and unlikely intrinsically disordered regions. Its global and per-residue values are automatically calculated together with the pLDDT statistics (option -a plddt).

Importantly, even when a protein has a high per-residue score, the global score can be lower: indeed, incorrect assemblies of perfectly normal local arrangements can occur.

Implementation notes

The module SBL::Union_find provides two classes. The class SBL::Union_find::Union_find_DS implements the usual Union-find algorithm, and also build the persistence diagram for connected components. The class SBL::Union_find::Union_find_wrapper wraps the latter, taking as input a list of pairs (index, associated value), with value used to build the filtration.

The module SBL::AF_null_model provides the class SBL::AF_null_model::AF_null_model implementing the null model using a Union_find_wrapper, and the associated permutation pvalue.

The module SBL::AF_filtrations provides the following classes: SBL::AF_filtrations::AF_filtrations_finder uses Union-find to study the filtrations; SBL::AF_filtrations::AF_filtrations_directory runs the previous on a directory; SBL::AF_filtrations::AF_filtrations_parallel handles the parallel mode for files in a directory.

Visualization, Plugins, GUIs

The SBL provides VMD, PyMOL, and Web plugins for sbl-alphafold-dbrun.py. Launch the VMD plugin from the SBL catalog under the Extensions menu, start the PyMOL plugin by running sbl_alphafold_dbrun in the PyMOL command line, and open the Web plugin from the SBL web plugins launcher page (served locally via sbl-web-plugins-launcher). All three plugins expose the same GUI and capabilities.

sbl-alphafold-dbrun plugin user interface.
The user interface for the sbl-alphafold-dbrun plugin. The typical workflow is: Step 1: Provide input file. Step 2: Specify the parameters for the analysis. Step 3: Choose an output directory (optional, a default is provided). Step 4: Launch the computation. Step 5: Review the results in GUI.