Genetrank

Authors: A. Sales-de-Queiroz and G. Sales Santa Cruz and A. Jean-Marie and D. Mazauric and J. Roux and F. Cazals

Goals

Single cell RNA sequencing. Single cell RNA sequencing (scRNAseq) consists of (i) dissociating cells in a tissue, (ii) performing cell isolation, (iii) extracting mRNA and amplifying them, and (iv) counting the number of transcripts on a per gene basis.

Ideally, scRNAseq allows one to bridge the gap between expression profiles a single cell phenotypes. This endeavor is however especially challenging for two main reasons; first, the cells processed may occupy a wide variety of cell states; second, the low mRNA counts on a per cell basis are such that a number of genes may be missed – the drop-out phenomenon.

A classical analysis performed on scRNAseq is the identification of deferentially expressed genes. A number of different techniques have been developed for it, see [176].

$\text{\genetrank}$ .

A complementary type of analysis ambitions to understand which DE genes interact with a specific molecular pathway, e.g. that of apoptosis. More specifically, consider the following triple:

A set of genes , possibly produced by a single-cell differential expression analysis.

A second set of genes involved in a specific pathway accounting for a phenotype

A protein interaction network (PPIN) containing the genes from and as nodes.

In this work, we address the problem of prioritizing the genes in given the genes in , to single out those genes having a higher likelihood to regulate the pathway of interest.

The tool developed to do so is called $\text{\genetrank}$ , is based on the theory of random walk in graphs, and Markov chains.

Prerequisites

In the following, we provide an intuitive presentation of $\text{\genetrank}$ , and refer the reader/user to [65] for the details.

Model and directions X to P and P to X

We consider a PPIN whose vertices are the individual molecules, and whose edges represent pairwise interactions. Such a network is modeled by a vertex-weighted edge-weighted directed graph . The vertices of the graph are denoted ${ v_1, \dots, v_n}$ . (Nb: The weights are used to define Markov models, see below.)

To present the formalism, the two sets of nodes are denoted and , which allows us to consider two instantiations:

$\stdirxp$ : : focus on paths starting at nodes in and ending in
$\stdirpx$ : : focus on paths starting at nodes in and ending in .

Random walks with restarts and Markov chains

Consider:

A set of source vertices $S\subset V$ ,
A subset $S'\subseteq S$ ,
A target set $T = {t_{1}, \ldots, t_{k}} \subset V$ , with $T\cap S = \emptyset$ .

We assume that the reader/user is acquainted with the following concepts, see e.g. [25] :

Random walk (RW) on a graph: a sequence of vertices generated by iteratively visiting a neighbor of the current vertex. The choice of the next vertex is carried at random using a probability distribution on the neighbors of the current vertex.

RW with restart (RWR): a RW such that once in a while, one restarts from a node amidst a specified set . When $\size{S'}=1$ , one restarts to a single vertex. The restart rate is denoted .

Absorbing node: a node which is never exited once reached.

Random walks on graphs are best studied using the framework of Markov chains. Recall that a Markov chain is a stochastic processes visiting states. In our case, the states are the vertices of the graph, and the transition probabilities are encoded in a transition matrix, whose non null entries correspond to the edges of the graph . In the context of gene ranking, the rationale for using RWR is as follows:

RW: a strategy to exploit all paths joining any two nodes, rather than the shortest path.

RWR: a strategy to avoid getting lost too far away in the PPIN, by returning to nodes of interest once in a while. This strategy also allows setting a soft limit on the length of the walk.

Absorbing node/state: a state in , which are the endpoints of the RWR.

Hit scores and symmetry

Hit vectors and hit scores.. After a number of steps of the random walk:

(State distribution) Consider an initial distribution uniform in the set

. Given any $r \in [0,1)$ and any $S' \subseteq S$ , the state distribution at each step $i \geq 0$ is denoted

$\begin{equation} \grwsd[M^{r}_{S'}]^{i} = (\grwsd[M^{r}_{S'}]^{i}(v_{1}),\dots,\grwsd[M^{r}_{S'}]^{i}(v_{n}) ), \end{equation}$

with $\grwsd[M^{r}_{S'}]^{0}(u) = 0$ for every $u \notin S'$ and $\grwsd[M^{r}_{S'}]^{0}(u) = \frac{1}{|S'|}$ for every $u \in S'$ .

Under suitable hypothesis, the previous quantities admit a limit when $i \rightarrow \infty$ . Dropping the superscript in $\pi^i$ , and focusing on the target set $T = {t_{1}, \ldots, t_{k}}$ , we define:

(hit probability vector) Given any $r \in [0,1)$ and any $S' \subseteq S$ , the hit vector $\grwsd[M^{r}_{S'}] = (\grwsd[M^{r}_{S'}](t_{1}),\dots,\grwsd[M^{r}_{S'}](t_{k}))$ is composed of the hitting probability for states of

.

Note that in the previous definition, $M^{r}_{S'}$ stand for the transition matrix of the Markov chain with restart . There are two cases of particular interest:

The construction can be used with , i.e. no restart.

Restart to a single vertex i.e. . In that case, we plainly use the notation $M^{r}_{s}$ .

The construction can also be used without any absorbing state. In that case, the random walk proceeds until its co-called stationary distribution – which exists under suitable hypothesis. The stationary probability of a particular node of the graph $t\in V$ is denoted for $\grwsd[M]{t}$ . This is a measure of the centrality of node : intuitively, nodes which are easily accessible from many others will have a large stationary probability.

Finally, we define the hit score, based on the hit probability, and the centrality of a node measured by $\grwsd[M]{t}$

(hit score) Given $r \in [0,1]$ , define the hit score from

to

, as

$\begin{equation} \scoredir{s}{t} = \grwsd[M^{r}_s]{t} / \grwsd[M]{t}, \end{equation}$

The hit score vector associated with each source $s \in S$ is $(\scoredir{s}{t_1}, \dots, \scoredir{s}{t_k})$ . The log score $\logscoredir{s}{t}$ is the natural logarithm of $\scoredir{s}{t}$ .

Instances (PPIN, , ). It is clear that using a different PPIN, a different experimental gene set or a different pathway gene set will affect the score $\scoredir{x}{p}$ for a given pair . In order to make it clearer we define an instance of execution as the triplet , and we refer to the score obtained under this instance as $\scoredirI{I}{x}{p}$ . However, for the sake of conciseness, we simply denote this score $\scoredir{x}{p}$ .

Symmetry. To analyse the (lack of) symmetry between paths joining and and vice-versa, we apply the previous model to two settings:

$\stdirxp : (S=X, T=P)$
$\stdirpx : (S=P, T=X)$

Saturation indices and hits

Using the hit scores in the two directions $\stdir{X}{P}$ versus $\stdir{P}{X}$ , we now define a ranking on the genes of :

(Average score) Let $1 \leq \tau \leq \size{P}$ be an integer. Consider a fixed value of the restart rate

. For a source

, let the average score be the arithmetic mean over the top $\tau$ values $\max( \log \scoredir{x}{p}, \log \scoredir{p}{x})$ observed for $p\in P$ . The gene network ranking ( $\text{\genetrank}$ ) of genes in

is the ranking associated with the aforementioned average values. The set of top

genes of the ranking is denoted $\topkatr{r}$ .

Note that when $\tau=1$ , the ranking of a gene in is determined by its largest max score. Intuitively, averaging scores over $\tau$ targets makes sense since our analysis aims at identifying a pathway.

To assess the stability of this ranking, we proceed as follows. Consider a set of values $R= { r_1,\dots, r_N}$ , sorted by increasing or decreasing value. We define the set of genes found in $\topkatr{r}$ up to a given value , $1 \leq r_l \leq N$ , by

$\begin{equation} \topkler{r_l} = \cup_{j=1,\dots,l} \topkatr{r_j}. \end{equation}$

We now use this set to qualify the speed at which we discover the sources in when increasing the upper bound on the restart rate:

(Saturation indices for an increasing sequence of values of

.) The saturation index at threshold

is the fraction of sources present in $\topkler{r_l}$ , that is:

$\begin{equation} \satindexler{r_l} = \frac{\size{\topkler{r_l} }}{\size{X}} (\leq 1). \end{equation}$

The relative saturation index is the latter normalized by the value of

used:

$\begin{equation} \satindexkler{r_l} = \frac{\satindexler{r_l}} {k}. \end{equation}$

In the absence of overlap between consecutive $\topkatr{r_l}$ , one would have $\size{\topkler{r_l}} = k \times l$ . Thus, normalizing by provides a measure of the overlap between consecutive sets.

We note in passing that the previous sets can be used to define how many hits in a given reference list of genes $\calL$ are obtained:

(Hits) Consider a reference list of genes $\calL$ . The number of hits for particular values

is the size of the set $\topkatr{r} \cap \calL$ . For a fixed

, we similarly defined the size of the set $\topkler{r}\cap \calL$ .

Graphical representations

Expression levels and fold-changes for genes in , MA plots. Two important quantities in transcriptomics are the expression level (mRNA count), and the expression changes, respectively denoted $\logCount$ and $\logXChange$ on a logarithmic scale. The associated scatter plot is often called a MA plot.

Score radar plots. The difficulty in working with values $\logCount(x), \logXChange(x)$ for $x\in X$ is that all pairs get mapped onto the same point. To get around this difficulty, we associate a radar plot with each point $x\in X$ , yielding an overall score radar scatter plot.
Each gene score radar plot is defined as follows:

the background of the gene radar plot is colored using a heat map indexed on the largest ( $\stdir{X}{P}$ or $\stdir{P}{X}$ ) log score observed for that gene. This background color makes it easy to spot the individual radar plots with high scores.

the gene radar plot has a number of spokes equal to the top (user defined) scores.

on each spoke, two values are found, namely the scores $\logscoredir{x}{p}$ and $\logscoredir{p}{x}$ .

finally, the radar plot title is set set to the gene name accompanied by the interval of scores (log scale).

Score radar scatter plot. Displaying all individual score radar plots in the $\logCount(x), \logXChange(x)$ plane yields the so-called Score radar scatter plot.

Formats for genes and networks

NB: While the Markov chain and score calculations in identifier-type agnostic, the generation of gene radars, and gene radar scatters, assumes that the identifiers used are protein identifiers, and could fail if another type of identifier is used.

Genes versus proteins. In the sequel, we manipulate genes and proteins. The following formats are used:

genes: format is UNIPROT_GN, see gene_name
proteins: format is UNIPROTSWISSPROT, see protein_Names and Uniprot

Protein Protein Interaction Network – PPIN. The PPIN file should be a tab-separated file of interactions, where each interaction has the two protein identifiers, and a weight . This weights is morally the interaction probability – so that 0 amount to removing the edge.

NB: These interactions are bidirectional ie give rise to arcs in both directions, whence two non null entries in the Markov chain transition matrix.

P04637  O60551  1.000000
P04637  P0CG48  1.000000
P04637  P46821  1.000000
P04637  O96017  1.000000
P04637  Q9BTK6  1.000000
P04637  Q8N488  1.000000
P04637  Q00987  1.000000
P04637  Q96PM5  1.000000
P04637  P18146  1.000000
P04637  P49841  1.000000
P04637  P32780  1.000000
P04637  Q92793  1.000000
P04637  P61289  1.000000
P04637  Q96GM8  1.000000
P04637  Q9NRR4  1.000000
P04637  P17844  1.000000
P04637  P35232  1.000000
P04637  Q8N726  1.000000
P04637  Q9UFV9  1.000000
P04637  Q9H1Z9  1.000000

In the previous file, each line also contains a weight, here set to one. The framework proposed in [65] indeed accommodates such weights.

Gene set and target set . The Sources and Targets files are lists of protein identifiers, one per line.

Example: 5 sources used [65]

Example: 51 sources used [65]

Q13618-1
Q07812
O00220-1
P25445
Q8WXG6-2
Q13158-1
O15392
Q13794
Q6FH21
Q13546
Q14790-1
O43521
Q07817
P55957
Q16611
Q8WXG6-3
Q8WZ73
O95831
P21580
O00198
O14727
P19438
P20333
Q8IX12
P55210
Q9UBN6
Q13323
O43464
P42574
P10415
P08574
Q13489
D3DV04
Q9NR28
P62877
Q15628
Q9UMX3
O14763-1
P98170
Q92843
P42575
Q92851-1
O15519-2
Q96LC9
Q13490
O14798
P55211
Q86W13
Q9BXH1
O15519-1
Q9NZS9

Markov models and hit scores..

We compute hit scores (Def. def-hit-score) using the C++ $\text{\marmote}$ library ([101] and Marmote).

For a given restart rate, we output a csv file containing the scores $\scoredir{x}{p}$ and $\scoredir{p}{x}$ for all pairs in $X \times P$ .

Example: results file for the whole MINT PPIN, obtained for – see [65]

Source (Protein)        Source (Gene)   Target (Protein)        Target (Gene)   Score_XP        Score_PX
Q969V5  MUL1    O43464  HTRA2   0.012238046596964237    2.2028917505855263
Q969V5  MUL1    Q9NR28  DIABLO  0.04255622863789793     6.505642343997667
Q02750  MAP2K1  O15519  CFLAR   0.02985460723965407     3.7143493702392227
P30626  SRI     Q92843  BCL2L2  0.07138915687956796     8.635773422413603
Q15631  TSN     O15519  CFLAR   0.04624289350994468     5.018111120967504
P35241  RDX     O43464  HTRA2   0.04477524544407649     4.803318296617468
Q53HI1  UNC50   Q92843  BCL2L2  0.024774205491692684    2.037103996551807
Q969V5  MUL1    Q15628  TRADD   0.06463971180069862     5.1916622637194525
P35241  RDX     Q9BXH1  BBC3    0.0377271121549757      2.599979561745427
Q9Y4W6  AFG3L2  P55211  CASP9   0.1258449394365558      7.929203013775988

Reranked gene lists. Our final product is the set of top genes defined by $\text{\genetrank}$ (Def. def-genetrank), possibly accumulated over several values of the restart rate (Def. eq-topkler).

These set depend on three parameters:

the restart rate
the number of targets $\tau$ used to compute the average score of a gene in
the number of genes top-ranked

We output such lists in plain txt files.

Radar plots.

We assume a MA plot file, containing triples (gene id, logFC, logCPM).

The radar plots are obtained using this file and the reranked gene list file.

Example radar plot.

The radar scatter plot combining the individual radar plots.

Using Genetrank

The package proves two executables compiled from C++ programs, and three python scripts. We now briefly describe these programs, and refer the user to the Jupyter notebook for example calls.

C++ based executables.

$\text{\sblgenetrankmatrix}$ : program computing the transition matrices for the Markov chains representing the random walks (see [65] ).

$\text{\sblgenetrankproba}$ : program computing the hit probability vectors (Def. def-hit-proba-vector) computed from the transition matrices.

Python scripts.

$\text{\sblgenetrankgene}$ : script converting a list of gene ids into a list of protein ids, or vice versa.

$\text{\sblgenetrankrun}$ : script running the C++ based executables, and subsequently producing the file of hit scores, as well as the radar plots.

$\text{\sblgenetranksaturation}$ : script exploiting the csv files containing all hit scores to compute the $\text{\genetrank}$ and the reranked gene lists.

Dependencies and Installation

As noticed above, Markov models rely on the C++ $\text{\marmote}$ library [101], see Marmote

To use this package, proceed as follows:

Clone the SBL

Compile the executable $\text{\sblgenetrankmatrix}$ and $\text{\sblgenetrankproba}$

Use these executables and the python scripts as indicated in the Jupyter notebook

Jupyter demo

See the following jupyter notebook:

Jupyter notebook file
PPIN_analysis

PPIN_source_target analysis¶
In [1]:
import os import subprocess import shutil from SBL import SBL_pytools from SBL_pytools import SBL_pytools as sblpyt from SBL import SBL_Genetrank_application from SBL_Genetrank_application import Genetrank_application from SBL import SBL_Genetrank_gene_to_protein from SBL_Genetrank_gene_to_protein import *
Step 0 / Pre-processing:¶

Convert the genes into targets i.e. protein identifiers¶
In [2]:
genes = [line.rstrip() for line in open("data/proteins-targets-apoptosis-49-entries.in").readlines()] print(genes) translator = Genetrank_GeneProteinTranslator('hsapiens', direction='protein2gene') translations = translator.translate(genes) print(translations) translations.summary()
['O43464', 'P19438', 'Q07817', 'O43521', 'D3DV04', 'P20333', 'Q13489', 'Q6FH21', 'P55211', 'O95831', 'Q13794', 'Q15628', 'Q8IX12', 'P98170', 'P55957', 'Q8WZ73', 'Q86W13', 'O14763', 'Q13618', 'Q07812', 'P62877', 'Q13546', 'Q9UMX3', 'P55210', 'O14798', 'Q9BXH1', 'O15392', 'Q13158', 'Q9NZS9', 'P21580', 'Q92843', 'O15519', 'P08574', 'P42575', 'P10415', 'O00198', 'O00220', 'Q92851', 'Q8WXG6', 'Q16611', 'Q13323', 'Q96LC9', 'Q13490', 'Q9NR28', 'P42574', 'Q14790', 'Q9UBN6', 'P25445', 'O14727'] Succesful Translations: O43464: HTRA2 P19438: TNFRSF1A Q07817: BCL2L1 O43521: BCL2L11 P20333: TNFRSF1B Q13489: BIRC3 P55211: CASP9 O95831: AIFM1 Q13794: PMAIP1 Q15628: TRADD Q8IX12: CCAR1 P98170: XIAP P55957: BID Q8WZ73: RFFL O14763: TNFRSF10B Q13618: CUL3 Q07812: BAX P62877: RBX1 Q13546: RIPK1 Q9UMX3: BOK P55210: CASP7 O14798: TNFRSF10C Q9BXH1: BBC3 O15392: BIRC5 Q13158: FADD Q9NZS9: BFAR P21580: TNFAIP3 Q92843: BCL2L2 O15519: CFLAR P08574: CYC1 P42575: CASP2 P10415: BCL2 O00198: HRK O00220: TNFRSF10A Q92851: CASP10 Q8WXG6: MADD Q16611: BAK1 Q13323: BIK Q96LC9: BMF Q13490: BIRC2 Q9NR28: DIABLO P42574: CASP3 Q14790: CASP8 P25445: FAS O14727: APAF1 Ambiguous Translations: Unable to Translate: D3DV04 Q6FH21 Q86W13 Q9UBN6 Of 49 identifiers: 45 successfully translated 0 ambiguously translated 4 unable to be translated
A Step-by-step analysis¶
In [3]:
from SBL import SBL_Genetrank_application from SBL_Genetrank_application import Genetrank_application from SBL import SBL_pytools from SBL_pytools import * odir = "test5s" if os.path.exists(odir): pass #os.system( ("rm -rf %s" % odir)) os.system( ("mkdir -p %s" % odir)) ppin_file_path = 'data/MINT-human-august2020.in' sources_file_path = 'data/proteins-sources-5-entries.in' targets_file_path = 'data/proteins-targets-apoptosis-49-entries.in' pathways_dir_path = '/tmp/pathways' os.mkdir(pathways_dir_path) app = Genetrank_application(odir) app.add_ppin_file(ppin_file_path) app.add_sources_file(sources_file_path) app.add_targets_file(targets_file_path) app.add_pathways_dir(pathways_dir_path) app.add_restart_probability(0.01) app.add_restart_probability(0.3) app.generate_inputs() app.instantiate_simulations()
In [4]:
def odir_tree(highlight_pattern='^'): cmd = "tree %s | grep --color=always -e '^' -e '%s'" % (odir, highlight_pattern) tree = ''.join(os.popen(cmd).readlines()) return tree
Step 1: generating the Markov chains¶
Uses the executable sbl-genetrank-Markov-models.exe to generate the Markov Chain .mcl transition matrix files to be used by MARMOTE. These files are one per source, named mcr_<internal_source_id>.mcl. markov_chain.mcl contains the normalization Markov Chain's transition matrix.

source_idx.txt, target_idx.txt, and map_protein_name_idx.txt are also produced to be used to cross-reference the identifiers used internally by the executable with the protein identifiers.
In [5]:
app.generate_markov_chains() print(odir_tree('*.mcl'))
Creating directory for r=0.01 (cmd=mkdir -p test5s/r0.01) Creating directory for r=0.30 (cmd=mkdir -p test5s/r0.30) Generating Markov Chains (one for each source)... PPIN = MINT-human-august2020 SOURCES = proteins-sources-5-entries TARGETS = proteins-targets-apoptosis-49-entries r = 0.01 Running cmd=/home/asq/Dev/inria/projects/sbl/Applications/Genetrank/src/build/sbl-genetrank-Markov-models.exe -g data/MINT-human-august2020.in -p /tmp/pathways -s data/proteins-sources-5-entries.in -t data/proteins-targets-apoptosis-49-entries.in -r 0.01 -o test5s/r0.01 Generating Markov Chains (one for each source)... PPIN = MINT-human-august2020 SOURCES = proteins-sources-5-entries TARGETS = proteins-targets-apoptosis-49-entries r = 0.30 Running cmd=/home/asq/Dev/inria/projects/sbl/Applications/Genetrank/src/build/sbl-genetrank-Markov-models.exe -g data/MINT-human-august2020.in -p /tmp/pathways -s data/proteins-sources-5-entries.in -t data/proteins-targets-apoptosis-49-entries.in -r 0.30 -o test5s/r0.30 ...Markov Chains generated. (they can be found at test5s/r0.30/MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.30) ...Markov Chains generated. (they can be found at test5s/r0.01/MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.01) Creating directory for r=0.01 (cmd=mkdir -p test5s/r0.01) Creating directory for r=0.30 (cmd=mkdir -p test5s/r0.30) Generating Markov Chains (one for each source)... Generating Markov Chains (one for each source)... PPIN = MINT-human-august2020 SOURCES = proteins-targets-apoptosis-49-entries TARGETS = proteins-sources-5-entries r = 0.01 PPIN = MINT-human-august2020 SOURCES = proteins-targets-apoptosis-49-entries TARGETS = proteins-sources-5-entries r = 0.30 Running cmd=/home/asq/Dev/inria/projects/sbl/Applications/Genetrank/src/build/sbl-genetrank-Markov-models.exe -g data/MINT-human-august2020.in -p /tmp/pathways -s data/proteins-targets-apoptosis-49-entries.in -t data/proteins-sources-5-entries.in -r 0.01 -o test5s/r0.01 Running cmd=/home/asq/Dev/inria/projects/sbl/Applications/Genetrank/src/build/sbl-genetrank-Markov-models.exe -g data/MINT-human-august2020.in -p /tmp/pathways -s data/proteins-targets-apoptosis-49-entries.in -t data/proteins-sources-5-entries.in -r 0.30 -o test5s/r0.30 ...Markov Chains generated. (they can be found at test5s/r0.30/MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.30) ...Markov Chains generated. (they can be found at test5s/r0.01/MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.01) test5s ├── figures │ └── r0.30 ├── r0.01 │ ├── MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.01 │ │ ├── map_protein_name_idx.txt │ │ ├── markov_chain.mcl │ │ ├── mcr_2437.mcl │ │ ├── mcr_3187.mcl │ │ ├── mcr_3480.mcl │ │ ├── mcr_420.mcl │ │ ├── mcr_679.mcl │ │ ├── source_idx.txt │ │ ├── source_names.txt │ │ └── target_idx.txt │ └── MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.01 │ ├── map_protein_name_idx.txt │ ├── markov_chain.mcl │ ├── mcr_10406.mcl │ ├── mcr_10715.mcl │ ├── mcr_1560.mcl │ ├── mcr_2123.mcl │ ├── mcr_2292.mcl │ ├── mcr_2743.mcl │ ├── mcr_2777.mcl │ ├── mcr_2823.mcl │ ├── mcr_2995.mcl │ ├── mcr_3584.mcl │ ├── mcr_3585.mcl │ ├── mcr_385.mcl │ ├── mcr_4131.mcl │ ├── mcr_4132.mcl │ ├── mcr_4158.mcl │ ├── mcr_4407.mcl │ ├── mcr_4686.mcl │ ├── mcr_4930.mcl │ ├── mcr_4932.mcl │ ├── mcr_5157.mcl │ ├── mcr_5201.mcl │ ├── mcr_5251.mcl │ ├── mcr_5252.mcl │ ├── mcr_5266.mcl │ ├── mcr_5296.mcl │ ├── mcr_5326.mcl │ ├── mcr_5499.mcl │ ├── mcr_551.mcl │ ├── mcr_560.mcl │ ├── mcr_5668.mcl │ ├── mcr_5770.mcl │ ├── mcr_699.mcl │ ├── mcr_722.mcl │ ├── mcr_7333.mcl │ ├── mcr_8055.mcl │ ├── mcr_8223.mcl │ ├── mcr_8226.mcl │ ├── mcr_858.mcl │ ├── mcr_868.mcl │ ├── mcr_9210.mcl │ ├── mcr_9945.mcl │ ├── source_idx.txt │ ├── source_names.txt │ └── target_idx.txt └── r0.30 ├── MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.30 │ ├── map_protein_name_idx.txt │ ├── markov_chain.mcl │ ├── mcr_2437.mcl │ ├── mcr_3187.mcl │ ├── mcr_3480.mcl │ ├── mcr_420.mcl │ ├── mcr_679.mcl │ ├── source_idx.txt │ ├── source_names.txt │ └── target_idx.txt └── MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.30 ├── map_protein_name_idx.txt ├── markov_chain.mcl ├── mcr_10406.mcl ├── mcr_10715.mcl ├── mcr_1560.mcl ├── mcr_2123.mcl ├── mcr_2292.mcl ├── mcr_2743.mcl ├── mcr_2777.mcl ├── mcr_2823.mcl ├── mcr_2995.mcl ├── mcr_3584.mcl ├── mcr_3585.mcl ├── mcr_385.mcl ├── mcr_4131.mcl ├── mcr_4132.mcl ├── mcr_4158.mcl ├── mcr_4407.mcl ├── mcr_4686.mcl ├── mcr_4930.mcl ├── mcr_4932.mcl ├── mcr_5157.mcl ├── mcr_5201.mcl ├── mcr_5251.mcl ├── mcr_5252.mcl ├── mcr_5266.mcl ├── mcr_5296.mcl ├── mcr_5326.mcl ├── mcr_5499.mcl ├── mcr_551.mcl ├── mcr_560.mcl ├── mcr_5668.mcl ├── mcr_5770.mcl ├── mcr_699.mcl ├── mcr_722.mcl ├── mcr_7333.mcl ├── mcr_8055.mcl ├── mcr_8223.mcl ├── mcr_8226.mcl ├── mcr_858.mcl ├── mcr_868.mcl ├── mcr_9210.mcl ├── mcr_9945.mcl ├── source_idx.txt ├── source_names.txt └── target_idx.txt 8 directories, 112 files
Step 2: computing hit probabilities¶
Uses the executable sbl-genetrank-hit-probabilities.exe to calculate the pairwise hit probabilities which are written to hit-vectors.txt and the centrality for normalization, written to normalization-distribution.txt
In [6]:
app.calculate_hit_probabilities() print(odir_tree())
Calculating Hit Probability Vectors... Calculating Hit Probability Vectors... PPIN = MINT-human-august2020 SOURCES = proteins-sources-5-entries TARGETS = proteins-targets-apoptosis-49-entries r = 0.30 PPIN = MINT-human-august2020 SOURCES = proteins-sources-5-entries TARGETS = proteins-targets-apoptosis-49-entries r = 0.01 Running cmd=/home/asq/Dev/inria/projects/sbl/Applications/Genetrank/src/build/sbl-genetrank-hit-probabilities.exe -i test5s/r0.30/MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.30 -o test5s/r0.30/MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.30 | grep Status Running cmd=/home/asq/Dev/inria/projects/sbl/Applications/Genetrank/src/build/sbl-genetrank-hit-probabilities.exe -i test5s/r0.01/MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.01 -o test5s/r0.01/MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.01 | grep Status ...Hit Probabilities Calculated. Source Target Hit Probability 0 O00429 O00220 0.002966 1 O15304 O00220 0.003361 2 P13196 O00220 0.003144 3 P30050 O00220 0.003933 4 P38919 O00220 0.003893 ...Hit Probabilities Calculated. Source Target Hit Probability 0 O00429 O00220 0.001218 1 O15304 O00220 0.003362 2 P13196 O00220 0.001755 3 P30050 O00220 0.011149 4 P38919 O00220 0.015557 Calculating Hit Probability Vectors... PPIN = MINT-human-august2020 SOURCES = proteins-targets-apoptosis-49-entries TARGETS = proteins-sources-5-entries r = 0.01 Running cmd=/home/asq/Dev/inria/projects/sbl/Applications/Genetrank/src/build/sbl-genetrank-hit-probabilities.exe -i test5s/r0.01/MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.01 -o test5s/r0.01/MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.01 | grep Status Calculating Hit Probability Vectors... PPIN = MINT-human-august2020 SOURCES = proteins-targets-apoptosis-49-entries TARGETS = proteins-sources-5-entries r = 0.30 Running cmd=/home/asq/Dev/inria/projects/sbl/Applications/Genetrank/src/build/sbl-genetrank-hit-probabilities.exe -i test5s/r0.30/MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.30 -o test5s/r0.30/MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.30 | grep Status ...Hit Probabilities Calculated. Source Target Hit Probability 0 O00429 O00220 0.130106 1 O15304 O00220 0.020702 2 P13196 O00220 0.234095 3 P30050 O00220 0.293829 4 P38919 O00220 0.320440 ...Hit Probabilities Calculated. Source Target Hit Probability 0 O00429 O00220 0.078315 1 O15304 O00220 0.005530 2 P13196 O00220 0.076769 3 P30050 O00220 0.419885 4 P38919 O00220 0.417971 test5s ├── figures │ └── r0.30 ├── r0.01 │ ├── MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.01 │ │ ├── hit-vectors.txt │ │ ├── map_protein_name_idx.txt │ │ ├── markov_chain.mcl │ │ ├── mcr_2437.mcl │ │ ├── mcr_3187.mcl │ │ ├── mcr_3480.mcl │ │ ├── mcr_420.mcl │ │ ├── mcr_679.mcl │ │ ├── normalization-distribution.txt │ │ ├── source_idx.txt │ │ ├── source_names.txt │ │ └── target_idx.txt │ └── MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.01 │ ├── hit-vectors.txt │ ├── map_protein_name_idx.txt │ ├── markov_chain.mcl │ ├── mcr_10406.mcl │ ├── mcr_10715.mcl │ ├── mcr_1560.mcl │ ├── mcr_2123.mcl │ ├── mcr_2292.mcl │ ├── mcr_2743.mcl │ ├── mcr_2777.mcl │ ├── mcr_2823.mcl │ ├── mcr_2995.mcl │ ├── mcr_3584.mcl │ ├── mcr_3585.mcl │ ├── mcr_385.mcl │ ├── mcr_4131.mcl │ ├── mcr_4132.mcl │ ├── mcr_4158.mcl │ ├── mcr_4407.mcl │ ├── mcr_4686.mcl │ ├── mcr_4930.mcl │ ├── mcr_4932.mcl │ ├── mcr_5157.mcl │ ├── mcr_5201.mcl │ ├── mcr_5251.mcl │ ├── mcr_5252.mcl │ ├── mcr_5266.mcl │ ├── mcr_5296.mcl │ ├── mcr_5326.mcl │ ├── mcr_5499.mcl │ ├── mcr_551.mcl │ ├── mcr_560.mcl │ ├── mcr_5668.mcl │ ├── mcr_5770.mcl │ ├── mcr_699.mcl │ ├── mcr_722.mcl │ ├── mcr_7333.mcl │ ├── mcr_8055.mcl │ ├── mcr_8223.mcl │ ├── mcr_8226.mcl │ ├── mcr_858.mcl │ ├── mcr_868.mcl │ ├── mcr_9210.mcl │ ├── mcr_9945.mcl │ ├── normalization-distribution.txt │ ├── source_idx.txt │ ├── source_names.txt │ └── target_idx.txt └── r0.30 ├── MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.30 │ ├── hit-vectors.txt │ ├── map_protein_name_idx.txt │ ├── markov_chain.mcl │ ├── mcr_2437.mcl │ ├── mcr_3187.mcl │ ├── mcr_3480.mcl │ ├── mcr_420.mcl │ ├── mcr_679.mcl │ ├── normalization-distribution.txt │ ├── source_idx.txt │ ├── source_names.txt │ └── target_idx.txt └── MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.30 ├── hit-vectors.txt ├── map_protein_name_idx.txt ├── markov_chain.mcl ├── mcr_10406.mcl ├── mcr_10715.mcl ├── mcr_1560.mcl ├── mcr_2123.mcl ├── mcr_2292.mcl ├── mcr_2743.mcl ├── mcr_2777.mcl ├── mcr_2823.mcl ├── mcr_2995.mcl ├── mcr_3584.mcl ├── mcr_3585.mcl ├── mcr_385.mcl ├── mcr_4131.mcl ├── mcr_4132.mcl ├── mcr_4158.mcl ├── mcr_4407.mcl ├── mcr_4686.mcl ├── mcr_4930.mcl ├── mcr_4932.mcl ├── mcr_5157.mcl ├── mcr_5201.mcl ├── mcr_5251.mcl ├── mcr_5252.mcl ├── mcr_5266.mcl ├── mcr_5296.mcl ├── mcr_5326.mcl ├── mcr_5499.mcl ├── mcr_551.mcl ├── mcr_560.mcl ├── mcr_5668.mcl ├── mcr_5770.mcl ├── mcr_699.mcl ├── mcr_722.mcl ├── mcr_7333.mcl ├── mcr_8055.mcl ├── mcr_8223.mcl ├── mcr_8226.mcl ├── mcr_858.mcl ├── mcr_868.mcl ├── mcr_9210.mcl ├── mcr_9945.mcl ├── normalization-distribution.txt ├── source_idx.txt ├── source_names.txt └── target_idx.txt 8 directories, 120 files
Step 3: computing scores¶
Scores are calculated by the Genetrank_simulation_result class within SBL_Genetrank_simulation.py using the results from the previous step. Pairwise scores are written to <sources_file_name>_<targets_file_name>_pairwise_scores.csv
In [7]:
app.calculate_scores() print(odir_tree())
test5s ├── figures │ └── r0.30 ├── r0.01 │ ├── MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.01 │ │ ├── hit-vectors.txt │ │ ├── map_protein_name_idx.txt │ │ ├── markov_chain.mcl │ │ ├── mcr_2437.mcl │ │ ├── mcr_3187.mcl │ │ ├── mcr_3480.mcl │ │ ├── mcr_420.mcl │ │ ├── mcr_679.mcl │ │ ├── normalization-distribution.txt │ │ ├── source_idx.txt │ │ ├── source_names.txt │ │ └── target_idx.txt │ ├── MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.01 │ │ ├── hit-vectors.txt │ │ ├── map_protein_name_idx.txt │ │ ├── markov_chain.mcl │ │ ├── mcr_10406.mcl │ │ ├── mcr_10715.mcl │ │ ├── mcr_1560.mcl │ │ ├── mcr_2123.mcl │ │ ├── mcr_2292.mcl │ │ ├── mcr_2743.mcl │ │ ├── mcr_2777.mcl │ │ ├── mcr_2823.mcl │ │ ├── mcr_2995.mcl │ │ ├── mcr_3584.mcl │ │ ├── mcr_3585.mcl │ │ ├── mcr_385.mcl │ │ ├── mcr_4131.mcl │ │ ├── mcr_4132.mcl │ │ ├── mcr_4158.mcl │ │ ├── mcr_4407.mcl │ │ ├── mcr_4686.mcl │ │ ├── mcr_4930.mcl │ │ ├── mcr_4932.mcl │ │ ├── mcr_5157.mcl │ │ ├── mcr_5201.mcl │ │ ├── mcr_5251.mcl │ │ ├── mcr_5252.mcl │ │ ├── mcr_5266.mcl │ │ ├── mcr_5296.mcl │ │ ├── mcr_5326.mcl │ │ ├── mcr_5499.mcl │ │ ├── mcr_551.mcl │ │ ├── mcr_560.mcl │ │ ├── mcr_5668.mcl │ │ ├── mcr_5770.mcl │ │ ├── mcr_699.mcl │ │ ├── mcr_722.mcl │ │ ├── mcr_7333.mcl │ │ ├── mcr_8055.mcl │ │ ├── mcr_8223.mcl │ │ ├── mcr_8226.mcl │ │ ├── mcr_858.mcl │ │ ├── mcr_868.mcl │ │ ├── mcr_9210.mcl │ │ ├── mcr_9945.mcl │ │ ├── normalization-distribution.txt │ │ ├── source_idx.txt │ │ ├── source_names.txt │ │ └── target_idx.txt │ └── proteins-sources-5-entries_proteins-targets-apoptosis-49-entries-pairwise_scores.csv └── r0.30 ├── MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.30 │ ├── hit-vectors.txt │ ├── map_protein_name_idx.txt │ ├── markov_chain.mcl │ ├── mcr_2437.mcl │ ├── mcr_3187.mcl │ ├── mcr_3480.mcl │ ├── mcr_420.mcl │ ├── mcr_679.mcl │ ├── normalization-distribution.txt │ ├── source_idx.txt │ ├── source_names.txt │ └── target_idx.txt ├── MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.30 │ ├── hit-vectors.txt │ ├── map_protein_name_idx.txt │ ├── markov_chain.mcl │ ├── mcr_10406.mcl │ ├── mcr_10715.mcl │ ├── mcr_1560.mcl │ ├── mcr_2123.mcl │ ├── mcr_2292.mcl │ ├── mcr_2743.mcl │ ├── mcr_2777.mcl │ ├── mcr_2823.mcl │ ├── mcr_2995.mcl │ ├── mcr_3584.mcl │ ├── mcr_3585.mcl │ ├── mcr_385.mcl │ ├── mcr_4131.mcl │ ├── mcr_4132.mcl │ ├── mcr_4158.mcl │ ├── mcr_4407.mcl │ ├── mcr_4686.mcl │ ├── mcr_4930.mcl │ ├── mcr_4932.mcl │ ├── mcr_5157.mcl │ ├── mcr_5201.mcl │ ├── mcr_5251.mcl │ ├── mcr_5252.mcl │ ├── mcr_5266.mcl │ ├── mcr_5296.mcl │ ├── mcr_5326.mcl │ ├── mcr_5499.mcl │ ├── mcr_551.mcl │ ├── mcr_560.mcl │ ├── mcr_5668.mcl │ ├── mcr_5770.mcl │ ├── mcr_699.mcl │ ├── mcr_722.mcl │ ├── mcr_7333.mcl │ ├── mcr_8055.mcl │ ├── mcr_8223.mcl │ ├── mcr_8226.mcl │ ├── mcr_858.mcl │ ├── mcr_868.mcl │ ├── mcr_9210.mcl │ ├── mcr_9945.mcl │ ├── normalization-distribution.txt │ ├── source_idx.txt │ ├── source_names.txt │ └── target_idx.txt └── proteins-sources-5-entries_proteins-targets-apoptosis-49-entries-pairwise_scores.csv 8 directories, 122 files
Step 4: visualization with radar (scatter) plots¶

1. Individual Radar Plots¶
In [8]:
from SBL import SBL_Genetrank_simulation_statistics from SBL_Genetrank_simulation_statistics import Genetrank_simulation_visualizations r03_simulation = app.get_simulation(ppin_file_path, sources_file_path, targets_file_path, pathways_dir_path, 0.3) r03_simulation_visualizations = Genetrank_simulation_visualizations(r03_simulation) r03_simulation_visualizations.generate_gene_score_radar('SIVA1', show=True)
2. Radar Scatter Plots¶
In [12]:
app.logFC_logCPM_file = 'data/CellFate-LogCPM-LogFC-pval-FDR-mmc4.xls' app.generate_gene_score_radar_scatter_plots() print(odir_tree())
findfont: Font family ['normal'] not found. Falling back to DejaVu Sans. findfont: Font family ['normal'] not found. Falling back to DejaVu Sans. findfont: Font family ['normal'] not found. Falling back to DejaVu Sans. findfont: Font family ['normal'] not found. Falling back to DejaVu Sans. findfont: Font family ['normal'] not found. Falling back to DejaVu Sans.
test5s ├── figures │ ├── r0.01 │ │ └── score_radar_scatter.pdf │ └── r0.30 │ └── score_radar_scatter.pdf ├── r0.01 │ ├── MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.01 │ │ ├── hit-vectors.txt │ │ ├── map_protein_name_idx.txt │ │ ├── markov_chain.mcl │ │ ├── mcr_2437.mcl │ │ ├── mcr_3187.mcl │ │ ├── mcr_3480.mcl │ │ ├── mcr_420.mcl │ │ ├── mcr_679.mcl │ │ ├── normalization-distribution.txt │ │ ├── source_idx.txt │ │ ├── source_names.txt │ │ └── target_idx.txt │ ├── MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.01 │ │ ├── hit-vectors.txt │ │ ├── map_protein_name_idx.txt │ │ ├── markov_chain.mcl │ │ ├── mcr_10406.mcl │ │ ├── mcr_10715.mcl │ │ ├── mcr_1560.mcl │ │ ├── mcr_2123.mcl │ │ ├── mcr_2292.mcl │ │ ├── mcr_2743.mcl │ │ ├── mcr_2777.mcl │ │ ├── mcr_2823.mcl │ │ ├── mcr_2995.mcl │ │ ├── mcr_3584.mcl │ │ ├── mcr_3585.mcl │ │ ├── mcr_385.mcl │ │ ├── mcr_4131.mcl │ │ ├── mcr_4132.mcl │ │ ├── mcr_4158.mcl │ │ ├── mcr_4407.mcl │ │ ├── mcr_4686.mcl │ │ ├── mcr_4930.mcl │ │ ├── mcr_4932.mcl │ │ ├── mcr_5157.mcl │ │ ├── mcr_5201.mcl │ │ ├── mcr_5251.mcl │ │ ├── mcr_5252.mcl │ │ ├── mcr_5266.mcl │ │ ├── mcr_5296.mcl │ │ ├── mcr_5326.mcl │ │ ├── mcr_5499.mcl │ │ ├── mcr_551.mcl │ │ ├── mcr_560.mcl │ │ ├── mcr_5668.mcl │ │ ├── mcr_5770.mcl │ │ ├── mcr_699.mcl │ │ ├── mcr_722.mcl │ │ ├── mcr_7333.mcl │ │ ├── mcr_8055.mcl │ │ ├── mcr_8223.mcl │ │ ├── mcr_8226.mcl │ │ ├── mcr_858.mcl │ │ ├── mcr_868.mcl │ │ ├── mcr_9210.mcl │ │ ├── mcr_9945.mcl │ │ ├── normalization-distribution.txt │ │ ├── source_idx.txt │ │ ├── source_names.txt │ │ └── target_idx.txt │ └── proteins-sources-5-entries_proteins-targets-apoptosis-49-entries-pairwise_scores.csv └── r0.30 ├── MINT-human-august2020_proteins-sources-5-entries_proteins-targets-apoptosis-49-entries_r0.30 │ ├── hit-vectors.txt │ ├── map_protein_name_idx.txt │ ├── markov_chain.mcl │ ├── mcr_2437.mcl │ ├── mcr_3187.mcl │ ├── mcr_3480.mcl │ ├── mcr_420.mcl │ ├── mcr_679.mcl │ ├── normalization-distribution.txt │ ├── source_idx.txt │ ├── source_names.txt │ └── target_idx.txt ├── MINT-human-august2020_proteins-targets-apoptosis-49-entries_proteins-sources-5-entries_r0.30 │ ├── hit-vectors.txt │ ├── map_protein_name_idx.txt │ ├── markov_chain.mcl │ ├── mcr_10406.mcl │ ├── mcr_10715.mcl │ ├── mcr_1560.mcl │ ├── mcr_2123.mcl │ ├── mcr_2292.mcl │ ├── mcr_2743.mcl │ ├── mcr_2777.mcl │ ├── mcr_2823.mcl │ ├── mcr_2995.mcl │ ├── mcr_3584.mcl │ ├── mcr_3585.mcl │ ├── mcr_385.mcl │ ├── mcr_4131.mcl │ ├── mcr_4132.mcl │ ├── mcr_4158.mcl │ ├── mcr_4407.mcl │ ├── mcr_4686.mcl │ ├── mcr_4930.mcl │ ├── mcr_4932.mcl │ ├── mcr_5157.mcl │ ├── mcr_5201.mcl │ ├── mcr_5251.mcl │ ├── mcr_5252.mcl │ ├── mcr_5266.mcl │ ├── mcr_5296.mcl │ ├── mcr_5326.mcl │ ├── mcr_5499.mcl │ ├── mcr_551.mcl │ ├── mcr_560.mcl │ ├── mcr_5668.mcl │ ├── mcr_5770.mcl │ ├── mcr_699.mcl │ ├── mcr_722.mcl │ ├── mcr_7333.mcl │ ├── mcr_8055.mcl │ ├── mcr_8223.mcl │ ├── mcr_8226.mcl │ ├── mcr_858.mcl │ ├── mcr_868.mcl │ ├── mcr_9210.mcl │ ├── mcr_9945.mcl │ ├── normalization-distribution.txt │ ├── source_idx.txt │ ├── source_names.txt │ └── target_idx.txt └── proteins-sources-5-entries_proteins-targets-apoptosis-49-entries-pairwise_scores.csv 9 directories, 124 files

License for this package

Unless stated otherwise, packages of the SBL are distributed under the following specific license SBL license.

The $\text{\genetrank}$ package, however, is distributed under the Apache 2.0 License.

Table of Contents

Genetrank

Goals

Prerequisites

Model and directions X to P and P to X

Random walks with restarts and Markov chains

Hit scores and symmetry

Saturation indices and hits

Graphical representations

Formats for genes and networks

Using Genetrank

Dependencies and Installation

Jupyter demo

PPIN_source_target analysis¶

Step 0 / Pre-processing:¶

Convert the genes into targets i.e. protein identifiers¶

A Step-by-step analysis¶

Step 1: generating the Markov chains¶

Step 2: computing hit probabilities¶

Step 3: computing scores¶

Step 4: visualization with radar (scatter) plots¶

1. Individual Radar Plots¶

2. Radar Scatter Plots¶

License for this package