On the Mathematics of RNA Velocity II: Algorithmic Aspects

Tiejun Li; Yizhuo Wang; Guoguo Yang; Peijie Zhou

On the Mathematics of RNA Velocity II: Algorithmic Aspects

Tiejun Li, Yizhuo Wang, Guoguo Yang, Peijie Zhou

TL;DR

The paper tackles the algorithmic foundations of RNA velocity analysis by fixing a global gene-shared latent time, quantifying uncertainty in EM-inferred kinetic parameters, optimizing the velocity-kernel bandwidth for random-walk approximations, and deriving transition-time estimates between cell states via mean first hitting times. It introduces two time-scale fixation strategies (multiplicative and additive noise models) yielding closed-form solutions for the gene-shared time and gene-re-wide scaling, and provides synthetic validations demonstrating robust time alignment across genes. Uncertainty quantification is grounded in Fisher information and SEM-based EM analyses, delivering practical confidence intervals for kinetic parameters and velocity directions. Finally, the work analyzes the finite-sample behavior of velocity-induced random walks, identifies the optimal kernel bandwidth scaling with sample size and dimension, and presents a taboo-set technique to obtain meaningful transition times in bifurcating developmental trajectories. Collectively, these contributions offer rigorous, implementable tools to improve robustness and interpretability of RNA velocity workflows in scRNA-seq data.

Abstract

In a previous paper [CSIAM Trans. Appl. Math. 2 (2021), 1-55], the authors proposed a theoretical framework for the analysis of RNA velocity, which is a promising concept in scRNA-seq data analysis to reveal the cell state-transition dynamical processes underlying snapshot data. The current paper is devoted to the algorithmic study of some key components in RNA velocity workflow. Four important points are addressed in this paper: (1) We construct a rational time-scale fixation method which can determine the global gene-shared latent time for cells. (2) We present an uncertainty quantification strategy for the inferred parameters obtained through the EM algorithm. (3) We establish the optimal criterion for the choice of velocity kernel bandwidth with respect to the sample size in the downstream analysis and discuss its implications. (4) We propose a temporal distance estimation approach between two cell clusters along the cellular development path. Some illustrative numerical tests are also carried out to verify our analysis. These results are intended to provide tools and insights in further development of RNA velocity type methods in the future.

On the Mathematics of RNA Velocity II: Algorithmic Aspects

TL;DR

Abstract

Paper Structure (20 sections, 7 theorems, 113 equations, 5 figures)

This paper contains 20 sections, 7 theorems, 113 equations, 5 figures.

Introduction
Fixation of time re-scaling constants
Problem setup
Gene-shared time through optimization
Numerical validation
Uncertainty quantification of RNA velocity
Problem setup
Confidence interval construction through Fisher information
Numerical validation
Optimal choice of kernel bandwidth in random walk construction
Problem setup
Estimation of operator convergence and algorithmic insight
Numerical validation
Transition time estimation among cell states
Problem setup
...and 5 more sections

Key Result

Theorem 1

Assume that the inferred gene-specific cell time matrix $T$ satisfies the condition Then the optimization problem eq:timescale has the unique solution $x^*=d \lambda_1^{-1/2} W v_1$, where $v_1$ is the $\ell^2$-unit eigenvector corresponding to the maximal eigenvalue $\lambda_1$ of and it has positive components. The global gene-shared common time

Figures (5)

Figure 1: The computational workflow of RNA velocity analysis and under-addressed issues.
Figure 2: Resolving scaling parameters in simulation data. (A) The empirical distribution of simulated splicing rates $\beta_g$, in which the samples are generated from a log-normal distribution. (B) The dynamics of unspliced mRNA and the simulated data with different splicing rates $\beta_g$. (C) The comparison between gene-specific time and gene-shared time. The blue curve denotes the distribution of the Pearson correlation between each pair of gene-specific time sequences, showing that the correlation between direct inferred times is low. The orange and green curves are the distribution of correlation between $t^{*,1}$, $t^{*,2}$ and all the gene-specific times, respectively, revealing a more compatible pattern. (D) A further comparison between gene-specific time and gene-shared time. The blue boxes correspond to the distributions of five important statistics of each gene-specific time sequence's correlations. The distribution of the correlation between gene-shared time and gene-specific time is drawn as the orange and green boxes. The rescaled time is as consistent as the highest correlation of genes. (E) The correlation of genes-specific time with $t^{(r)}$ compared with the correlation between the gene-shared time and $t^{(r)}$ (the red dashed line for $t^{*,1}$ and the orange one for $t^{*,2}$). Rescaled time is the most consistent time sequence to $t^{(r)}$, while some of the gene-specific times highly differ from the ground truth. (F) Scatter plot of $t^{(r)}$ and the rescaled time $t^{*,1}$, the red line denotes the fitted line. The rescaled time shows an obvious linear pattern in regard to the real time. (G) The cosine correlation between $t^{*,1}$ and $t^{*,2}$. The gene-shared times rescaled by two proposals are very close to each other. (H) The cosine correlation between $\beta^{*,1}$ and $\beta^{*,2}$, showing very similar rescaled parameters. (I) Visualization of simulated data and the fitted streamlines based on UMAP mcinnes2018umap. The coloring is based on the rescaled time normalizing to the interval $[0,1]$, which is consistent with the embedded streamlines.
Figure 3: Uncertainty quantification of RNA velocity using simulation data. (A) Simulation of transcriptional process captures transcriptional induction and repression ("on" and "off" stages) of unspliced and spliced mRNA. The top panel shows the abundance of spliced mRNA and the bottom panel shows unspliced and spliced mRNA in phase space. To distinguish the results, we use blue plots for on-stage and orange for off-stage. (B) The $95 \%$-prediction intervals are presented together with the inferred parameters ("on" and "off" stages). It can be found that almost all of the parameters we infer are within the confidence interval. (C) The violin plots depict the distribution of inferred parameters of on-stage (left panel and middle panel) and off-stage (right panel). Fitted parameters mostly lie in a small range. (D) The norm of inferred RNA velocity is compared to the norm of true velocity at on-stage (left panel) and off-stage (middle panel) which show high concentration near $1$, also, the cosine of the angle between inferred velocity and true velocity is shown for off-stage (right panel) which also distributed near $1$. (E) The bias of inferred velocity under different degradation rates $\gamma$ at on-stage and off-stage, which is close to normal distribution.
Figure 4: Effect of kernel bandwidth $\epsilon$ on operator approximation. The logarithmic plots of the operator approximation error for the linear case $f_1(x)=x_1+x_2+x_3$ (left panel) and nonlinear case $f_2(x)=x_3^2$ (right panel). The errors of both cases are averaged over $n=2000$ samples. The minimal errors are obtained when $\ln(\epsilon)\approx -3.45$ (left panel) and $\ln(\epsilon)\approx -3.20$ (right panel). When $\epsilon\lesssim n^{-2/5}$, the dominant term of the error should be $O({1}/(\sqrt{n}\epsilon^{\frac{3}{4}}))$. This gives the slope $-3/4$ in theory, which is close to the estimated value $-0.74$ (left panel) or $-0.76$ (right panel) by linear regression.
Figure 5: Estimating the pseudo-temporal distance via the first hitting times. (A) Schematics of the naive and taboo set models. (B) The physical time of the synthetic model compared with the mean first hitting time to the target set in the end. There is a linear pattern for the computed mean first hitting time and the red line is obtained by linear regression. (C) The physical time compared with the mean first hitting time to a target set in the middle. For cells before this target set $A$, there is still a linear pattern which is shown in the inset, in which the red dashed line is obtained by linear regression. (D) In the bifurcation case, the physical time compared with the mean first hitting time to one branch computed by iterative method. Cells in the other branch and before bifurcation have very large mean first hitting times. (E) By setting the other branch as the taboo set, the mean first hitting time shows a nearly linear pattern. (F) Streamline embedded UMAP plot of the synthetic bifurcation data. The two expected termination sets are circled. (G) UMAP of the synthetic data. The left panel is colored according to the mean first hitting time to the lower branch termination cells, and the right panel is colored according to the absolute value of the difference between two computed mean first hitting times to two circled branches in Fig. \ref{['fig_transition']}F.

Theorems & Definitions (15)

Theorem 1
proof
Remark 1
Theorem 2
proof
Remark 2
Theorem 3: Finite sample approximation of the operator $\mathcal{L}_{\epsilon}$
Lemma 1
Theorem 4
proof
...and 5 more

On the Mathematics of RNA Velocity II: Algorithmic Aspects

TL;DR

Abstract

On the Mathematics of RNA Velocity II: Algorithmic Aspects

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (15)