Table of Contents
Fetching ...

Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics

Ihor Kendiukhov

TL;DR

This paper establishes the first systematic neural scaling laws for masked-reconstruction transformers trained on single-cell RNA-seq data. By comparing a data-rich regime (512 HVGs, 200,000 cells) with a data-limited regime (1,024 HVGs, 10,000 cells), the authors show clear power-law scaling of validation MSE with model size in the data-rich regime, characterized by an exponent of about $\alpha \approx 0.23$–$0.27$ and an irreducible floor of $c \approx 1.44$, corresponding to roughly $2.30$ bits of entropy per masked gene position. In the data-limited regime, scaling virtually vanishes ($\alpha \approx 0.009$), indicating data scarcity as the binding constraint. The work also provides a preliminary entropy-based interpretation of the irreducible floor and discusses practical implications for designing single-cell foundation models, emphasizing the data-to-parameter ratio and noting diminishing returns beyond ~20M parameters under current data conditions. Overall, the results offer a quantitative framework for predicting performance, informing data collection, and guiding architectural choices in single-cell pretraining. $L(P)=aP^{-\lpha}+c$ and $c\approx 1.44$ are central findings, with an entropy estimate of about $2.30$ bits per masked position for Regime A.

Abstract

Neural scaling laws -- power-law relationships between loss, model size, and data -- have been extensively documented for language and vision transformers, yet their existence in single-cell genomics remains largely unexplored. We present the first systematic study of scaling behaviour for masked-reconstruction transformers trained on single-cell RNA sequencing (scRNA-seq) data. Using expression profiles from the CELLxGENE Census, we construct two experimental regimes: a data-rich regime (512 highly variable genes, 200,000 cells) and a data-limited regime (1,024 genes, 10,000 cells). Across seven model sizes spanning three orders of magnitude in parameter count (533 to 3.4 x 10^8 parameters), we fit the parametric scaling law to validation mean squared error (MSE). The data-rich regime exhibits clear power-law scaling with an irreducible loss floor of c ~ 1.44, while the data-limited regime shows negligible scaling, indicating that model capacity is not the binding constraint when data are scarce. These results establish that scaling laws analogous to those observed in natural language processing do emerge in single-cell transcriptomics when sufficient data are available, and they identify the data-to-parameter ratio as a critical determinant of scaling behaviour. A preliminary conversion of the data-rich asymptotic floor to information-theoretic units yields an estimate of approximately 2.30 bits of entropy per masked gene position. We discuss implications for the design of single-cell foundation models and outline the additional measurements needed to refine this entropy estimate.

Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics

TL;DR

This paper establishes the first systematic neural scaling laws for masked-reconstruction transformers trained on single-cell RNA-seq data. By comparing a data-rich regime (512 HVGs, 200,000 cells) with a data-limited regime (1,024 HVGs, 10,000 cells), the authors show clear power-law scaling of validation MSE with model size in the data-rich regime, characterized by an exponent of about and an irreducible floor of , corresponding to roughly bits of entropy per masked gene position. In the data-limited regime, scaling virtually vanishes (), indicating data scarcity as the binding constraint. The work also provides a preliminary entropy-based interpretation of the irreducible floor and discusses practical implications for designing single-cell foundation models, emphasizing the data-to-parameter ratio and noting diminishing returns beyond ~20M parameters under current data conditions. Overall, the results offer a quantitative framework for predicting performance, informing data collection, and guiding architectural choices in single-cell pretraining. and are central findings, with an entropy estimate of about bits per masked position for Regime A.

Abstract

Neural scaling laws -- power-law relationships between loss, model size, and data -- have been extensively documented for language and vision transformers, yet their existence in single-cell genomics remains largely unexplored. We present the first systematic study of scaling behaviour for masked-reconstruction transformers trained on single-cell RNA sequencing (scRNA-seq) data. Using expression profiles from the CELLxGENE Census, we construct two experimental regimes: a data-rich regime (512 highly variable genes, 200,000 cells) and a data-limited regime (1,024 genes, 10,000 cells). Across seven model sizes spanning three orders of magnitude in parameter count (533 to 3.4 x 10^8 parameters), we fit the parametric scaling law to validation mean squared error (MSE). The data-rich regime exhibits clear power-law scaling with an irreducible loss floor of c ~ 1.44, while the data-limited regime shows negligible scaling, indicating that model capacity is not the binding constraint when data are scarce. These results establish that scaling laws analogous to those observed in natural language processing do emerge in single-cell transcriptomics when sufficient data are available, and they identify the data-to-parameter ratio as a critical determinant of scaling behaviour. A preliminary conversion of the data-rich asymptotic floor to information-theoretic units yields an estimate of approximately 2.30 bits of entropy per masked gene position. We discuss implications for the design of single-cell foundation models and outline the additional measurements needed to refine this entropy estimate.
Paper Structure (47 sections, 8 equations, 2 figures, 4 tables)

This paper contains 47 sections, 8 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Validation MSE versus parameter count $P$ on log-log axes, with fitted scaling curves (Equation \ref{['eq:scaling']}). Each point represents one training run (best checkpoint). Regime A (left) shows clear power-law decay with $R^2 = 0.82$; Regime B (right) is effectively flat with $R^2 = 0.02$.
  • Figure 2: Canonical Regime A scaling fits (18 runs, 6 sizes $\times$ 3 seeds, all 60k steps). Left: MSE metric. Right: derived Gaussian NLL metric. Both converge on an entropy floor of $\sim$2.30 bits per masked position.