Table of Contents
Fetching ...

scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics

Davide D'Ascenzo, Sebastiano Cultrera di Montesano

TL;DR

The paper tackles the data-loading bottleneck in atlas-scale single-cell deep learning by introducing scDataset, a PyTorch IterableDataset that achieves quasi-random minibatch sampling directly from on-disk AnnData via block sampling and batched fetching. It provides explicit theoretical bounds on minibatch diversity and demonstrates substantial throughput gains (over two orders of magnitude in some cases) while preserving random-sampling efficacy in downstream tasks. Importantly, scDataset remains compatible with existing ecosystems and formats, supports multiprocessing and distributed training, and is backend-agnostic, enabling practical atlas-scale training on commodity hardware. The work offers a principled, tunable trade-off between I/O efficiency and minibatch diversity, with broad generalizability to other large, clustered datasets and a clear path toward future storage-format optimizations such as Zarr v3.

Abstract

Training deep learning models on single-cell datasets with hundreds of millions of cells requires loading data from disk, as these datasets exceed available memory. While random sampling provides the data diversity needed for effective training, it is prohibitively slow due to the random access pattern overhead, whereas sequential streaming achieves high throughput but introduces biases that degrade model performance. We present scDataset, a PyTorch data loader that enables efficient training from on-disk data with seamless integration across diverse storage formats. Our approach combines block sampling and batched fetching to achieve quasi-random sampling that balances I/O efficiency with minibatch diversity. On Tahoe-100M, a dataset of 100 million cells, scDataset achieves more than two orders of magnitude speedup compared to true random sampling while working directly with AnnData files. We provide theoretical bounds on minibatch diversity and empirically show that scDataset matches the performance of true random sampling across multiple classification tasks.

scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics

TL;DR

The paper tackles the data-loading bottleneck in atlas-scale single-cell deep learning by introducing scDataset, a PyTorch IterableDataset that achieves quasi-random minibatch sampling directly from on-disk AnnData via block sampling and batched fetching. It provides explicit theoretical bounds on minibatch diversity and demonstrates substantial throughput gains (over two orders of magnitude in some cases) while preserving random-sampling efficacy in downstream tasks. Importantly, scDataset remains compatible with existing ecosystems and formats, supports multiprocessing and distributed training, and is backend-agnostic, enabling practical atlas-scale training on commodity hardware. The work offers a principled, tunable trade-off between I/O efficiency and minibatch diversity, with broad generalizability to other large, clustered datasets and a clear path toward future storage-format optimizations such as Zarr v3.

Abstract

Training deep learning models on single-cell datasets with hundreds of millions of cells requires loading data from disk, as these datasets exceed available memory. While random sampling provides the data diversity needed for effective training, it is prohibitively slow due to the random access pattern overhead, whereas sequential streaming achieves high throughput but introduces biases that degrade model performance. We present scDataset, a PyTorch data loader that enables efficient training from on-disk data with seamless integration across diverse storage formats. Our approach combines block sampling and batched fetching to achieve quasi-random sampling that balances I/O efficiency with minibatch diversity. On Tahoe-100M, a dataset of 100 million cells, scDataset achieves more than two orders of magnitude speedup compared to true random sampling while working directly with AnnData files. We provide theoretical bounds on minibatch diversity and empirically show that scDataset matches the performance of true random sampling across multiple classification tasks.

Paper Structure

This paper contains 31 sections, 3 theorems, 14 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

As $f \to \infty$ with fixed $m, b, K$, the expected entropy satisfies where $H(p) = -\sum_{k=1}^K p_k \log_2 p_k$.

Figures (7)

  • Figure 1: scDataset bridges diverse data backends with PyTorch's DataLoader through a modular interface. Data retrieval is managed by a configurable fetch_callback, followed by preprocessing with fetch_transform (e.g., sparse-to-dense conversion). Batches are selected using batch_callback and further processed with batch_transform before being yielded to the training pipeline.
  • Figure 2: Data loading throughput on AnnData as a function of block size and fetch factor. Throughput (samples/sec) increases with both parameters, reaching 204$\times$ speedup over AnnLoader at the largest values.
  • Figure 3: Effect of fetch factor on streaming throughput from AnnData. Batched fetching amortizes fixed I/O overhead, achieving over 15$\times$ speedup at $f=1024$ compared to iterative minibatch fetching.
  • Figure 4: Plate label entropy within minibatches as a function of block size and fetch factor. Higher fetch factors compensate for the diversity loss from larger block sizes.
  • Figure 5: Classification performance (mean $\pm$ std over 2 runs) across four tasks. BlockShuffling with $b=16$, $f=256$ matches random sampling $(b=1)$ performance, while sequential and buffered streaming underperform due to plate-scale heterogeneity.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Theorem 3.1: Large fetch factor
  • Theorem 3.2: No batched fetching
  • Corollary 3.3: General case