Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Genetrank

Authors: A. Sales-de-Queiroz and G. Sales Santa Cruz and A. Jean-Marie and D. Mazauric and J. Roux and F. Cazals

Goals

Single cell RNA sequencing. Single cell RNA sequencing (scRNAseq) consists of (i) dissociating cells in a tissue, (ii) performing cell isolation, (iii) extracting mRNA and amplifying them, and (iv) counting the number of transcripts on a per gene basis.

Ideally, scRNAseq allows one to bridge the gap between expression profiles a single cell phenotypes. This endeavor is however especially challenging for two main reasons; first, the cells processed may occupy a wide variety of cell states; second, the low mRNA counts on a per cell basis are such that a number of genes may be missed – the drop-out phenomenon.

A classical analysis performed on scRNAseq is the identification of deferentially expressed genes. A number of different techniques have been developed for it, see [176].

$\text{\genetrank}$.

A complementary type of analysis ambitions to understand which DE genes interact with a specific molecular pathway, e.g. that of apoptosis. More specifically, consider the following triple:

  • A set of genes $X$, possibly produced by a single-cell differential expression analysis.
  • A second set of genes $P$ involved in a specific pathway accounting for a phenotype
  • A protein interaction network (PPIN) containing the genes from $X$ and $P$ as nodes.

In this work, we address the problem of prioritizing the genes in $X$ given the genes in $P$, to single out those genes having a higher likelihood to regulate the pathway of interest.

The tool developed to do so is called $\text{\genetrank}$, is based on the theory of random walk in graphs, and Markov chains.

Prerequisites

In the following, we provide an intuitive presentation of $\text{\genetrank}$, and refer the reader/user to [65] for the details.

Model and directions X to P and P to X

We consider a PPIN whose vertices are the individual molecules, and whose edges represent pairwise interactions. Such a network is modeled by a vertex-weighted edge-weighted directed graph $G = (V,E)$. The vertices of the graph are denoted $ { v_1, \dots, v_n}$. (Nb: The weights are used to define Markov models, see below.)

To present the formalism, the two sets of nodes are denoted $S$ and $T$, which allows us to consider two instantiations:

  • $\stdirxp$: $(S=X, T=P)$: focus on paths starting at nodes in $X$ and ending in $P$
  • $\stdirpx$: $(S=P, T=X)$: focus on paths starting at nodes in $P$ and ending in $X$.

Random walks with restarts and Markov chains

Consider:

  • A set of source vertices $S\subset V$,
  • A subset $S'\subseteq S$,
  • A target set $T = {t_{1}, \ldots, t_{k}} \subset V$, with $T\cap S = \emptyset$.

We assume that the reader/user is acquainted with the following concepts, see e.g. [25] :

  • Random walk (RW) on a graph: a sequence of vertices generated by iteratively visiting a neighbor of the current vertex. The choice of the next vertex is carried at random using a probability distribution on the neighbors of the current vertex.
  • RW with restart (RWR): a RW such that once in a while, one restarts from a node amidst a specified set $S'$. When $\size{S'}=1$, one restarts to a single vertex. The restart rate is denoted $r$.
  • Absorbing node: a node which is never exited once reached.

Random walks on graphs are best studied using the framework of Markov chains. Recall that a Markov chain is a stochastic processes visiting states. In our case, the states are the vertices of the graph, and the transition probabilities are encoded in a transition matrix, whose non null entries correspond to the edges $E$ of the graph $G$. In the context of gene ranking, the rationale for using RWR is as follows:

  • RW: a strategy to exploit all paths joining any two nodes, rather than the shortest path.
  • RWR: a strategy to avoid getting lost too far away in the PPIN, by returning to nodes of interest once in a while. This strategy also allows setting a soft limit on the length of the walk.
  • Absorbing node/state: a state in $T$, which are the endpoints of the RWR.

Hit scores and symmetry

Hit vectors and hit scores.. After a number of steps of the random walk:

(State distribution) Consider an initial distribution uniform in the set $S'$. Given any $r \in [0,1)$ and any $S' \subseteq S$, the state distribution at each step $i \geq 0$ is denoted

\begin{equation} \grwsd[M^{r}_{S'}]^{i} = (\grwsd[M^{r}_{S'}]^{i}(v_{1}),\dots,\grwsd[M^{r}_{S'}]^{i}(v_{n}) ), \end{equation}

with $\grwsd[M^{r}_{S'}]^{0}(u) = 0$ for every $u \notin S'$ and $\grwsd[M^{r}_{S'}]^{0}(u) = \frac{1}{|S'|}$ for every $u \in S'$.


Under suitable hypothesis, the previous quantities admit a limit when $i \rightarrow \infty$. Dropping the superscript in $\pi^i$, and focusing on the target set $T = {t_{1}, \ldots, t_{k}} $, we define:

(hit probability vector) Given any $r \in [0,1)$ and any $S' \subseteq S$, the hit vector $\grwsd[M^{r}_{S'}] = (\grwsd[M^{r}_{S'}](t_{1}),\dots,\grwsd[M^{r}_{S'}](t_{k}))$ is composed of the hitting probability for states of $T$.


Note that in the previous definition, $M^{r}_{S'}$ stand for the transition matrix of the Markov chain with restart $S'$. There are two cases of particular interest:

  • The construction can be used with $r=0$, i.e. no restart.
  • Restart to a single vertex i.e. $S' = { s } $. In that case, we plainly use the notation $M^{r}_{s}$.
  • The construction can also be used without any absorbing state. In that case, the random walk proceeds until its co-called stationary distribution – which exists under suitable hypothesis. The stationary probability of a particular node of the graph $t\in V$ is denoted for $\grwsd[M]{t}$. This is a measure of the centrality of node $t$: intuitively, nodes which are easily accessible from many others will have a large stationary probability.

Finally, we define the hit score, based on the hit probability, and the centrality of a node measured by $\grwsd[M]{t}$

(hit score) Given $r \in [0,1]$, define the hit score from $s$ to $t$, as

\begin{equation} \scoredir{s}{t} = \grwsd[M^{r}_s]{t} / \grwsd[M]{t}, \end{equation}

The hit score vector associated with each source $s \in S$ is $(\scoredir{s}{t_1}, \dots, \scoredir{s}{t_k})$. The log score $\logscoredir{s}{t}$ is the natural logarithm of $\scoredir{s}{t}$.


Instances (PPIN, $X$, $P$). It is clear that using a different PPIN, a different experimental gene set $X$ or a different pathway gene set $P$ will affect the score $\scoredir{x}{p}$ for a given pair $(x, p)$. In order to make it clearer we define an instance of execution as the triplet $I = (PPIN, X, P)$, and we refer to the score obtained under this instance as $\scoredirI{I}{x}{p}$. However, for the sake of conciseness, we simply denote this score $\scoredir{x}{p}$.

Symmetry. To analyse the (lack of) symmetry between paths joining $X$ and $P$ and vice-versa, we apply the previous model to two settings:

  • $\stdirxp : (S=X, T=P)$
  • $\stdirpx : (S=P, T=X)$

Saturation indices and hits

Using the hit scores in the two directions $\stdir{X}{P}$ versus $\stdir{P}{X}$, we now define a ranking on the genes of $X$:

(Average score) Let $1 \leq \tau \leq \size{P}$ be an integer. Consider a fixed value of the restart rate $r$. For a source $x$, let the average score be the arithmetic mean over the top $\tau$ values $\max( \log \scoredir{x}{p}, \log \scoredir{p}{x})$ observed for $p\in P$. The gene network ranking ( $\text{\genetrank}$) of genes in $X$ is the ranking associated with the aforementioned average values. The set of top $k$ genes of the ranking is denoted $\topkatr{r}$.


Note that when $\tau=1$, the ranking of a gene in $X$ is determined by its largest max score. Intuitively, averaging scores over $\tau$ targets makes sense since our analysis aims at identifying a pathway.

To assess the stability of this ranking, we proceed as follows. Consider a set of values $R= { r_1,\dots, r_N}$, sorted by increasing or decreasing value. We define the set of genes found in $\topkatr{r}$ up to a given value $r_l$, $1 \leq r_l \leq N$, by

\begin{equation} \topkler{r_l} = \cup_{j=1,\dots,l} \topkatr{r_j}. \end{equation}

We now use this set to qualify the speed at which we discover the sources in $X$ when increasing the upper bound on the restart rate:

(Saturation indices for an increasing sequence of values of $r$.) The saturation index at threshold $r_l$ is the fraction of sources present in $\topkler{r_l}$, that is:

\begin{equation} \satindexler{r_l} = \frac{\size{\topkler{r_l} }}{\size{X}} (\leq 1). \end{equation}

The relative saturation index is the latter normalized by the value of $k$ used:

\begin{equation} \satindexkler{r_l} = \frac{\satindexler{r_l}} {k}. \end{equation}


In the absence of overlap between consecutive $\topkatr{r_l}$, one would have $\size{\topkler{r_l}} = k \times l$. Thus, normalizing by $k$ provides a measure of the overlap between consecutive sets.

We note in passing that the previous sets can be used to define how many hits in a given reference list of genes $\calL$ are obtained:

(Hits) Consider a reference list of genes $\calL$. The number of hits for particular values $(r,k)$ is the size of the set $\topkatr{r} \cap \calL$. For a fixed $k$, we similarly defined the size of the set $\topkler{r}\cap \calL$.


Graphical representations

Expression levels and fold-changes for genes in $X$, MA plots. Two important quantities in transcriptomics are the expression level (mRNA count), and the expression changes, respectively denoted $\logCount$ and $\logXChange$ on a logarithmic scale. The associated scatter plot is often called a MA plot.

Score radar plots. The difficulty in working with values $\logCount(x), \logXChange(x)$ for $x\in X$ is that all pairs $(x,p)$ get mapped onto the same point. To get around this difficulty, we associate a radar plot with each point $x\in X$, yielding an overall score radar scatter plot.
Each gene score radar plot is defined as follows:

  • the background of the gene radar plot is colored using a heat map indexed on the largest ( $\stdir{X}{P}$ or $\stdir{P}{X}$) log score observed for that gene. This background color makes it easy to spot the individual radar plots with high scores.
  • the gene radar plot has a number of spokes equal to the top $k$ (user defined) scores.
  • on each spoke, two values are found, namely the scores $\logscoredir{x}{p}$ and $\logscoredir{p}{x}$.
  • finally, the radar plot title is set set to the gene name accompanied by the interval of scores (log scale).

Score radar scatter plot. Displaying all individual score radar plots in the $\logCount(x), \logXChange(x)$ plane yields the so-called Score radar scatter plot.

Formats for genes and networks

NB: While the Markov chain and score calculations in identifier-type agnostic, the generation of gene radars, and gene radar scatters, assumes that the identifiers used are protein identifiers, and could fail if another type of identifier is used.

Genes versus proteins. In the sequel, we manipulate genes and proteins. The following formats are used:

Protein Protein Interaction Network – PPIN. The PPIN file should be a tab-separated file of interactions, where each interaction has the two protein identifiers, and a weight $(0, 1]$. This weights is morally the interaction probability – so that 0 amount to removing the edge.

NB: These interactions are bidirectional ie give rise to arcs in both directions, whence two non null entries in the Markov chain transition matrix.

P04637 O60551 1.000000
P04637 P0CG48 1.000000
P04637 P46821 1.000000
P04637 O96017 1.000000
P04637 Q9BTK6 1.000000
P04637 Q8N488 1.000000
P04637 Q00987 1.000000
P04637 Q96PM5 1.000000
P04637 P18146 1.000000
P04637 P49841 1.000000
P04637 P32780 1.000000
P04637 Q92793 1.000000
P04637 P61289 1.000000
P04637 Q96GM8 1.000000
P04637 Q9NRR4 1.000000
P04637 P17844 1.000000
P04637 P35232 1.000000
P04637 Q8N726 1.000000
P04637 Q9UFV9 1.000000
P04637 Q9H1Z9 1.000000
In the previous file, each line also contains a weight, here set to one. The framework proposed in [65] indeed accommodates such weights.


Gene set $X$ and target set $P$. The Sources and Targets files are lists of protein identifiers, one per line.

Example: 5 sources used [65]

O00429
P13196
O15304
P30050
P38919

Example: 51 sources used [65]

Q13618-1
Q07812
O00220-1
P25445
Q8WXG6-2
Q13158-1
O15392
Q13794
Q6FH21
Q13546
Q14790-1
O43521
Q07817
P55957
Q16611
Q8WXG6-3
Q8WZ73
O95831
P21580
O00198
O14727
P19438
P20333
Q8IX12
P55210
Q9UBN6
Q13323
O43464
P42574
P10415
P08574
Q13489
D3DV04
Q9NR28
P62877
Q15628
Q9UMX3
O14763-1
P98170
Q92843
P42575
Q92851-1
O15519-2
Q96LC9
Q13490
O14798
P55211
Q86W13
Q9BXH1
O15519-1
Q9NZS9

Markov models and hit scores..

We compute hit scores (Def. def-hit-score) using the C++ $\text{\marmote}$ library ([101] and Marmote).

For a given restart rate, we output a csv file containing the scores $\scoredir{x}{p}$ and $\scoredir{p}{x}$ for all pairs in $X \times P$.

Example: results file for the whole MINT PPIN, obtained for $r=0.3$ – see [65]

Source (Protein) Source (Gene) Target (Protein) Target (Gene) Score_XP Score_PX
Q969V5 MUL1 O43464 HTRA2 0.012238046596964237 2.2028917505855263
Q969V5 MUL1 Q9NR28 DIABLO 0.04255622863789793 6.505642343997667
Q02750 MAP2K1 O15519 CFLAR 0.02985460723965407 3.7143493702392227
P30626 SRI Q92843 BCL2L2 0.07138915687956796 8.635773422413603
Q15631 TSN O15519 CFLAR 0.04624289350994468 5.018111120967504
P35241 RDX O43464 HTRA2 0.04477524544407649 4.803318296617468
Q53HI1 UNC50 Q92843 BCL2L2 0.024774205491692684 2.037103996551807
Q969V5 MUL1 Q15628 TRADD 0.06463971180069862 5.1916622637194525
P35241 RDX Q9BXH1 BBC3 0.0377271121549757 2.599979561745427
Q9Y4W6 AFG3L2 P55211 CASP9 0.1258449394365558 7.929203013775988

Reranked gene lists. Our final product is the set of top genes defined by $\text{\genetrank}$ (Def. def-genetrank), possibly accumulated over several values of the restart rate (Def. eq-topkler).

These set depend on three parameters:

  • the restart rate $r$
  • the number of targets $\tau$ used to compute the average score of a gene in $X$
  • the number $k$ of genes top-ranked

We output such lists in plain txt files.

Radar plots.

We assume a MA plot file, containing triples (gene id, logFC, logCPM).

The radar plots are obtained using this file and the reranked gene list file.

Example radar plot.

The radar scatter plot combining the individual radar plots.

Using Genetrank

The package proves two executables compiled from C++ programs, and three python scripts. We now briefly describe these programs, and refer the user to the Jupyter notebook for example calls.

C++ based executables.

  • $\text{\sblgenetrankmatrix}$ : program computing the transition matrices for the Markov chains representing the random walks (see [65] ).
  • $\text{\sblgenetrankproba}$: program computing the hit probability vectors (Def. def-hit-proba-vector) computed from the transition matrices.

Python scripts.

  • $\text{\sblgenetrankgene}$: script converting a list of gene ids into a list of protein ids, or vice versa.
  • $\text{\sblgenetrankrun}$: script running the C++ based executables, and subsequently producing the file of hit scores, as well as the radar plots.
  • $\text{\sblgenetranksaturation}$: script exploiting the csv files containing all hit scores to compute the $\text{\genetrank}$ and the reranked gene lists.

Dependencies and Installation

As noticed above, Markov models rely on the C++ $\text{\marmote}$ library [101], see Marmote

To use this package, proceed as follows:

  • Compile the executable $\text{\sblgenetrankmatrix}$ and $\text{\sblgenetrankproba}$
  • Use these executables and the python scripts as indicated in the Jupyter notebook

Jupyter demo

See the following jupyter notebook:

  • Jupyter notebook file
  • PPIN_analysis

License for this package

Unless stated otherwise, packages of the SBL are distributed under the following specific license SBL license.

The $\text{\genetrank}$ package, however, is distributed under the Apache 2.0 License.