Table of Contents
Fetching ...

scDD: Latent Codes Based scRNA-seq Dataset Distillation with Foundation Model Knowledge

Zhen Yu, Jianan Han, Yang Liu, Qingchao Chen

TL;DR

scDD tackles the scalability, sparsity, and imbalance challenges of scRNA-seq data by distilling foundation-model knowledge into compact latent codes and generating synthetic data via a novel single-step conditional diffusion generator (SCDG). By performing diffusion in latent space and updating latent codes rather than real gene-expression values, scDD preserves data characteristics while enabling strong cross-model generalization and downstream performance. The framework combines latent-code random initialization, condition-guided generation, and gradient/distribution matching with back-propagation through the latent space, yielding synthetic data that closely emulates the original dataset across tasks such as cell-type annotation and disease-status prediction. Extensive benchmarks across six public scRNA-seq datasets show consistent improvements over state-of-the-art distillation methods, with sizable absolute and relative gains, and strong robustness under cross-architecture evaluation, highlighting practical implications for privacy-preserving data sharing and scalable, multi-center analyses.

Abstract

Single-cell RNA sequencing (scRNA-seq) technology has profiled hundreds of millions of human cells across organs, diseases, development and perturbations to date. However, the high-dimensional sparsity, batch effect noise, category imbalance, and ever-increasing data scale of the original sequencing data pose significant challenges for multi-center knowledge transfer, data fusion, and cross-validation between scRNA-seq datasets. To address these barriers, (1) we first propose a latent codes-based scRNA-seq dataset distillation framework named scDD, which transfers and distills foundation model knowledge and original dataset information into a compact latent space and generates synthetic scRNA-seq dataset by a generator to replace the original dataset. Then, (2) we propose a single-step conditional diffusion generator named SCDG, which perform single-step gradient back-propagation to help scDD optimize distillation quality and avoid gradient decay caused by multi-step back-propagation. Meanwhile, SCDG ensures the scRNA-seq data characteristics and inter-class discriminability of the synthetic dataset through flexible conditional control and generation quality assurance. Finally, we propose a comprehensive benchmark to evaluate the performance of scRNA-seq dataset distillation in different data analysis tasks. It is validated that our proposed method can achieve 7.61% absolute and 15.70% relative improvement over previous state-of-the-art methods on average task.

scDD: Latent Codes Based scRNA-seq Dataset Distillation with Foundation Model Knowledge

TL;DR

scDD tackles the scalability, sparsity, and imbalance challenges of scRNA-seq data by distilling foundation-model knowledge into compact latent codes and generating synthetic data via a novel single-step conditional diffusion generator (SCDG). By performing diffusion in latent space and updating latent codes rather than real gene-expression values, scDD preserves data characteristics while enabling strong cross-model generalization and downstream performance. The framework combines latent-code random initialization, condition-guided generation, and gradient/distribution matching with back-propagation through the latent space, yielding synthetic data that closely emulates the original dataset across tasks such as cell-type annotation and disease-status prediction. Extensive benchmarks across six public scRNA-seq datasets show consistent improvements over state-of-the-art distillation methods, with sizable absolute and relative gains, and strong robustness under cross-architecture evaluation, highlighting practical implications for privacy-preserving data sharing and scalable, multi-center analyses.

Abstract

Single-cell RNA sequencing (scRNA-seq) technology has profiled hundreds of millions of human cells across organs, diseases, development and perturbations to date. However, the high-dimensional sparsity, batch effect noise, category imbalance, and ever-increasing data scale of the original sequencing data pose significant challenges for multi-center knowledge transfer, data fusion, and cross-validation between scRNA-seq datasets. To address these barriers, (1) we first propose a latent codes-based scRNA-seq dataset distillation framework named scDD, which transfers and distills foundation model knowledge and original dataset information into a compact latent space and generates synthetic scRNA-seq dataset by a generator to replace the original dataset. Then, (2) we propose a single-step conditional diffusion generator named SCDG, which perform single-step gradient back-propagation to help scDD optimize distillation quality and avoid gradient decay caused by multi-step back-propagation. Meanwhile, SCDG ensures the scRNA-seq data characteristics and inter-class discriminability of the synthetic dataset through flexible conditional control and generation quality assurance. Finally, we propose a comprehensive benchmark to evaluate the performance of scRNA-seq dataset distillation in different data analysis tasks. It is validated that our proposed method can achieve 7.61% absolute and 15.70% relative improvement over previous state-of-the-art methods on average task.

Paper Structure

This paper contains 36 sections, 5 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: scRNA-seq dataset distillation challenge. (a) directly updating gene expression values at the scRNA-seq data-level will cause them to loss inherently characteristics. (b) distillation with highly categorical imbalance problem leads to a loss of inter-class discriminability.
  • Figure 2: scDD overall framework. (a) SCDG generator is delivered combining the diffusion and foundation models to generate high-quality scRNA-seq data with flexible controlled conditions, which can perform single-step gradient back-propagation to help optimize the distillation quality without gradient decay. (b) scDD finally delivers a small-scale and desensitized synthetic scRNA-seq dataset, which can replace large-scale, high-dimensional, sparse and noisy original dataset, adapt to any foundation model to achieve different scRNA-seq data analysis tasks.
  • Figure 3: Different distillation matching targets and optimization pipelines. (a) DC/DM distills a data-level synthetic dataset by matching gradient and distribution targets. (b) FeatDistill performs distillation optimization on feature-level and delivers synthetic features under encoder. (c) SDXL-Turbo generates synthetic dataset for data augmentation but does not involve distillation matching. (d) TextDistill distills on feature-level but delivers data-level synthetic scRNA-seq data by decoder.
  • Figure 4: Ablation study on the distillation performance impact of task head’s network parameter size on the foundation model with SPC=1 in the STIZ-Kidney.
  • Figure 5: Visualize the top three highly variable gene expression values at the gene-level in scRNA-seq dataset, where the scRNA-seq dataset come from the original dataset, gradient-matching synthetic dataset, and synthetic dataset generated by scDD.
  • ...and 2 more figures