scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics
Davide D'Ascenzo, Sebastiano Cultrera di Montesano
TL;DR
The paper tackles the data-loading bottleneck in atlas-scale single-cell deep learning by introducing scDataset, a PyTorch IterableDataset that achieves quasi-random minibatch sampling directly from on-disk AnnData via block sampling and batched fetching. It provides explicit theoretical bounds on minibatch diversity and demonstrates substantial throughput gains (over two orders of magnitude in some cases) while preserving random-sampling efficacy in downstream tasks. Importantly, scDataset remains compatible with existing ecosystems and formats, supports multiprocessing and distributed training, and is backend-agnostic, enabling practical atlas-scale training on commodity hardware. The work offers a principled, tunable trade-off between I/O efficiency and minibatch diversity, with broad generalizability to other large, clustered datasets and a clear path toward future storage-format optimizations such as Zarr v3.
Abstract
Training deep learning models on single-cell datasets with hundreds of millions of cells requires loading data from disk, as these datasets exceed available memory. While random sampling provides the data diversity needed for effective training, it is prohibitively slow due to the random access pattern overhead, whereas sequential streaming achieves high throughput but introduces biases that degrade model performance. We present scDataset, a PyTorch data loader that enables efficient training from on-disk data with seamless integration across diverse storage formats. Our approach combines block sampling and batched fetching to achieve quasi-random sampling that balances I/O efficiency with minibatch diversity. On Tahoe-100M, a dataset of 100 million cells, scDataset achieves more than two orders of magnitude speedup compared to true random sampling while working directly with AnnData files. We provide theoretical bounds on minibatch diversity and empirically show that scDataset matches the performance of true random sampling across multiple classification tasks.
