Table of Contents
Fetching ...

Gene Incremental Learning for Single-Cell Transcriptomics

Jiaxin Qi, Yan Cui, Jianqiang Huang, Gaogang Xie

TL;DR

Gene Incremental Learning (GIL) introduces token-like learning for genes in single-cell transcriptomics, addressing the growth of gene sets by defining base genes and stage-specific gene partitions. The approach adapts Class Incremental Learning (CIL) ideas with a dedicated GIL objective $\mathcal{L}_{\text{GIL},s_k}$ and evaluation protocols, including gene-wise regression and gene-based classification, to quantify forgetting and knowledge transfer. Through baselines (baseline, replay) and knowledge-preserving strategies (distillation), the authors demonstrate that forgetting occurs in the vanilla setting and that both replay and distillation mitigate it, though with trade-offs on downstream classification. The work provides a scalable benchmark on CELLxGENE with six downstream tasks, validating the framework and suggesting future extensions to other token-learning domains in biology and beyond.

Abstract

Classes, as fundamental elements of Computer Vision, have been extensively studied within incremental learning frameworks. In contrast, tokens, which play essential roles in many research fields, exhibit similar characteristics of growth, yet investigations into their incremental learning remain significantly scarce. This research gap primarily stems from the holistic nature of tokens in language, which imposes significant challenges on the design of incremental learning frameworks for them. To overcome this obstacle, in this work, we turn to a type of token, gene, for a large-scale biological dataset--single-cell transcriptomics--to formulate a pipeline for gene incremental learning and establish corresponding evaluations. We found that the forgetting problem also exists in gene incremental learning, thus we adapted existing class incremental learning methods to mitigate the forgetting of genes. Through extensive experiments, we demonstrated the soundness of our framework design and evaluations, as well as the effectiveness of our method adaptations. Finally, we provide a complete benchmark for gene incremental learning in single-cell transcriptomics.

Gene Incremental Learning for Single-Cell Transcriptomics

TL;DR

Gene Incremental Learning (GIL) introduces token-like learning for genes in single-cell transcriptomics, addressing the growth of gene sets by defining base genes and stage-specific gene partitions. The approach adapts Class Incremental Learning (CIL) ideas with a dedicated GIL objective and evaluation protocols, including gene-wise regression and gene-based classification, to quantify forgetting and knowledge transfer. Through baselines (baseline, replay) and knowledge-preserving strategies (distillation), the authors demonstrate that forgetting occurs in the vanilla setting and that both replay and distillation mitigate it, though with trade-offs on downstream classification. The work provides a scalable benchmark on CELLxGENE with six downstream tasks, validating the framework and suggesting future extensions to other token-learning domains in biology and beyond.

Abstract

Classes, as fundamental elements of Computer Vision, have been extensively studied within incremental learning frameworks. In contrast, tokens, which play essential roles in many research fields, exhibit similar characteristics of growth, yet investigations into their incremental learning remain significantly scarce. This research gap primarily stems from the holistic nature of tokens in language, which imposes significant challenges on the design of incremental learning frameworks for them. To overcome this obstacle, in this work, we turn to a type of token, gene, for a large-scale biological dataset--single-cell transcriptomics--to formulate a pipeline for gene incremental learning and establish corresponding evaluations. We found that the forgetting problem also exists in gene incremental learning, thus we adapted existing class incremental learning methods to mitigate the forgetting of genes. Through extensive experiments, we demonstrated the soundness of our framework design and evaluations, as well as the effectiveness of our method adaptations. Finally, we provide a complete benchmark for gene incremental learning in single-cell transcriptomics.

Paper Structure

This paper contains 14 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustrations of (a) Class Incremental Learning (CIL) framework and (b) our proposed Gene Incremental Learning (GIL) framework. In CIL, the given classes are exclusive at each stage, and classification accuracy is tested across all previously seen classes. In GIL, $b_i, i=1,2,3,\ldots$ denote the base tokens given in every stage, while $T^{s_i}=\{t^{s_i}\}$ represents the set of specific tokens to be learned in stage $i$. For evaluation, regression refers to the token-wise regression loss, and $T^{s_i}$-based classification denotes performing the classification on the specific downstream dataset where the token set $T^{s_i}$ is crucial.
  • Figure 2: Illustrations of baseline methods for GIL in stage $k$, where we use $k=2$ as an example, and the samples from stage 2 are in yellow background. (a) The baseline shows the masked token prediction loss formulated in Eq. \ref{['eq:tran_objective']}. init denotes that the current model $\phi_{s_2}$ is initialized by the previous optimal model $\phi^*_{s_1}$ (b) Data Replay shows that some previous samples (with a green background) are maintained for training in the current stage. (c) Token Distillation shows how the previous optimal model distills knowledge through base token regression, which is formulated in Eq. \ref{['eq:distillation']}.
  • Figure 3: Test Accuracy (%) for the 3-stage GIL setting (ICol-Myel-Panc) on the corresponding downstream classification datasets. The crucial genes for the last dataset Panc only learned in the last stage, thus there is no forgetting problem and we omit the results for Panc here.