Table of Contents
Fetching ...

TEDDY: A Family Of Foundation Models For Understanding Single Cell Biology

Alexis Chevalier, Soumya Ghosh, Urvi Awasthi, James Watkins, Julia Bieniewska, Nichita Mitrea, Olga Kotova, Kirill Shkura, Andrew Noble, Michael Steinbaugh, Julien Delile, Christoph Meier, Leonid Zhukov, Iya Khalil, Srayanta Mukherjee, Judith Mueller

TL;DR

The TEDDY work interrogates whether scaling pre-training data and injecting biological annotations into single-cell foundation models can improve disease biology understanding from scRNA-seq data. By training two variants, TEDDY-G and TEDDY-X, on 116M cells and evaluating on held-out donors and diseases benchmarks, the authors demonstrate that larger models and annotation supervision yield notable gains in donor generalization and competitive performance on disease tasks. TEDDY-G 400M, in particular, outperforms several existing foundation models on held-out donors and approaches top performance on held-out diseases, while also offering improvements over task-specific methods. These results underscore the value of data scale and biologically informed supervision for translating atlas-scale single-cell data into disease-relevant representations, while highlighting limitations and avenues for future multi-modal integration and perturbation-based evaluations.

Abstract

Understanding the biological mechanism of disease is critical for medicine, and in particular drug discovery. AI-powered analysis of genome-scale biological data hold great potential in this regard. The increasing availability of single-cell RNA sequencing data has enabled the development of large foundation models for disease biology. However, existing foundation models either do not improve or only modestly improve over task-specific models in downstream applications. Here, we explored two avenues for improving the state-of-the-art. First, we scaled the pre-training dataset to 116 million cells, which is larger than those used by previous models. Second, we leveraged the availability of large-scale biological annotations as a form of supervision during pre-training. We trained the TEDDY family of models comprising six transformer-based state-of-the-art single-cell foundation models with 70 million, 160 million, and 400 million parameters. We vetted our models on two downstream evaluation tasks -- identifying the underlying disease state of held-out donors not seen during training and distinguishing healthy cells from diseased ones for disease conditions and donors not seen during training. Scaling experiments showed that performance improved predictably with both data volume and parameter count. Our models showed substantial improvement over existing work on the first task and more muted improvements on the second.

TEDDY: A Family Of Foundation Models For Understanding Single Cell Biology

TL;DR

The TEDDY work interrogates whether scaling pre-training data and injecting biological annotations into single-cell foundation models can improve disease biology understanding from scRNA-seq data. By training two variants, TEDDY-G and TEDDY-X, on 116M cells and evaluating on held-out donors and diseases benchmarks, the authors demonstrate that larger models and annotation supervision yield notable gains in donor generalization and competitive performance on disease tasks. TEDDY-G 400M, in particular, outperforms several existing foundation models on held-out donors and approaches top performance on held-out diseases, while also offering improvements over task-specific methods. These results underscore the value of data scale and biologically informed supervision for translating atlas-scale single-cell data into disease-relevant representations, while highlighting limitations and avenues for future multi-modal integration and perturbation-based evaluations.

Abstract

Understanding the biological mechanism of disease is critical for medicine, and in particular drug discovery. AI-powered analysis of genome-scale biological data hold great potential in this regard. The increasing availability of single-cell RNA sequencing data has enabled the development of large foundation models for disease biology. However, existing foundation models either do not improve or only modestly improve over task-specific models in downstream applications. Here, we explored two avenues for improving the state-of-the-art. First, we scaled the pre-training dataset to 116 million cells, which is larger than those used by previous models. Second, we leveraged the availability of large-scale biological annotations as a form of supervision during pre-training. We trained the TEDDY family of models comprising six transformer-based state-of-the-art single-cell foundation models with 70 million, 160 million, and 400 million parameters. We vetted our models on two downstream evaluation tasks -- identifying the underlying disease state of held-out donors not seen during training and distinguishing healthy cells from diseased ones for disease conditions and donors not seen during training. Scaling experiments showed that performance improved predictably with both data volume and parameter count. Our models showed substantial improvement over existing work on the first task and more muted improvements on the second.

Paper Structure

This paper contains 27 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An illustration of differences between Teddy variants. On the left we illustrate a cell with five non-zero expressed genes with non-zero median normalized expression values. Teddy-G represents a cell as a list of genes ordered by their expression levels and the pre-training task involves predicting the index of masked out genes. Teddy-X ranks expression values, scales them to the interval $[-1, +1]$, and then learns to predict a masked rank scaled to the interval $[-1, +1]$.
  • Figure 2: Scaling behavior.Left: Pre-training loss on held-out data for Teddy-G as a function of pre-training data and model size. Right: Validation loss at the end of one epoch of training against the number of parameters in the model. The dashed line in black is the linear best-fit: $14.95\times(\text{\# parameters})^{-0.10}$.
  • Figure 3: Performance on held-out donors as a function of model size and biological annotaionsLeft:$\text{F}_1$ scores improve for both Teddy-G and Teddy-X with increasing size. Teddy-G outperforms Teddy-X across model scale and improves on XGBoost trained exclusively for this task. Right: The x-axis plots the $\text{F}_1$ scores achieved by Teddy-G 70M and Teddy-X 70M when pre-trained with supervision from biological ontologies. The y-axis plots $\text{F}_1$ scores achieved by the same models without biological ontologies.
  • Figure 4: Performance of existing foundation models compared with Teddy-GTeddy-G consistently outperforms other foundation models. The x-axis plots fine-tuning accuracy achieved Teddy-G 400M. The y-axis is the accuracy achieved by other foundation models. Points below the diagonal indicate Teddy-G 400M achieves higher accuracy.
  • Figure 5: Performance of task-specific machine learning methods using handcrafted features and Teddy-G 400M embeddings Accuracy increases with use of Teddy-G embeddings. The x-axis plots accuracy achieved by different methods using Teddy-G 400M embeddings as features. The y-axis is the accuracy achieved when using hand engineered features. Points below the diagonal indicate that embedding features achieve higher accuracy.
  • ...and 1 more figures