Table of Contents
Fetching ...

DinoBloom: A Foundation Model for Generalizable Cell Embeddings in Hematology

Valentin Koch, Sophia J. Wagner, Salome Kazeminia, Ece Sancar, Matthias Hehr, Julia Schnabel, Tingying Peng, Carsten Marr

TL;DR

DinoBloom introduces the first foundation-model approach for single-cell hematology images by training a family of vision-transformer models with a tailored DINOv2 pipeline on a large multi-cohort dataset of over 380,000 WBC images from 13 datasets. The models learn generalizable, rich visual features that transfer well to unseen data, enabling accurate cell-type classification, AML subtype determination via weakly supervised MIL, and interpretable embeddings. Empirical results show DinoBloom surpasses non-medical and medical baselines on external data and across bone-marrow cytology tasks, with large models offering further gains and enabling visualization of biologically meaningful patterns. The work provides open-source models and code, highlighting potential to streamline hematology workflows and support cross-dataset analyses with reduced batch effects.

Abstract

In hematology, computational models offer significant potential to improve diagnostic accuracy, streamline workflows, and reduce the tedious work of analyzing single cells in peripheral blood or bone marrow smears. However, clinical adoption of computational models has been hampered by the lack of generalization due to large batch effects, small dataset sizes, and poor performance in transfer learning from natural images. To address these challenges, we introduce DinoBloom, the first foundation model for single cell images in hematology, utilizing a tailored DINOv2 pipeline. Our model is built upon an extensive collection of 13 diverse, publicly available datasets of peripheral blood and bone marrow smears, the most substantial open-source cohort in hematology so far, comprising over 380,000 white blood cell images. To assess its generalization capability, we evaluate it on an external dataset with a challenging domain shift. We show that our model outperforms existing medical and non-medical vision models in (i) linear probing and k-nearest neighbor evaluations for cell-type classification on blood and bone marrow smears and (ii) weakly supervised multiple instance learning for acute myeloid leukemia subtyping by a large margin. A family of four DinoBloom models (small, base, large, and giant) can be adapted for a wide range of downstream applications, be a strong baseline for classification problems, and facilitate the assessment of batch effects in new datasets. All models are available at github.com/marrlab/DinoBloom.

DinoBloom: A Foundation Model for Generalizable Cell Embeddings in Hematology

TL;DR

DinoBloom introduces the first foundation-model approach for single-cell hematology images by training a family of vision-transformer models with a tailored DINOv2 pipeline on a large multi-cohort dataset of over 380,000 WBC images from 13 datasets. The models learn generalizable, rich visual features that transfer well to unseen data, enabling accurate cell-type classification, AML subtype determination via weakly supervised MIL, and interpretable embeddings. Empirical results show DinoBloom surpasses non-medical and medical baselines on external data and across bone-marrow cytology tasks, with large models offering further gains and enabling visualization of biologically meaningful patterns. The work provides open-source models and code, highlighting potential to streamline hematology workflows and support cross-dataset analyses with reduced batch effects.

Abstract

In hematology, computational models offer significant potential to improve diagnostic accuracy, streamline workflows, and reduce the tedious work of analyzing single cells in peripheral blood or bone marrow smears. However, clinical adoption of computational models has been hampered by the lack of generalization due to large batch effects, small dataset sizes, and poor performance in transfer learning from natural images. To address these challenges, we introduce DinoBloom, the first foundation model for single cell images in hematology, utilizing a tailored DINOv2 pipeline. Our model is built upon an extensive collection of 13 diverse, publicly available datasets of peripheral blood and bone marrow smears, the most substantial open-source cohort in hematology so far, comprising over 380,000 white blood cell images. To assess its generalization capability, we evaluate it on an external dataset with a challenging domain shift. We show that our model outperforms existing medical and non-medical vision models in (i) linear probing and k-nearest neighbor evaluations for cell-type classification on blood and bone marrow smears and (ii) weakly supervised multiple instance learning for acute myeloid leukemia subtyping by a large margin. A family of four DinoBloom models (small, base, large, and giant) can be adapted for a wide range of downstream applications, be a strong baseline for classification problems, and facilitate the assessment of batch effects in new datasets. All models are available at github.com/marrlab/DinoBloom.
Paper Structure (5 sections, 3 figures, 3 tables)

This paper contains 5 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Data and model overview of our pipeline. (a) All $13$ datasets used in this study: dashed lines indicate datasets split into training data for DinoBloom and test data for downstream evaluations, continuous line indicates the dataset was completely held out for testing purposes. (b) Modified DINOv2 pipeline without local crops for model training. We evaluate the performance on three downstream tasks: (c) WBC type classification on the external dataset Acevedo, (d) AML subtype classification via multiple instance learning, and (e) bone marrow WBC type classification.
  • Figure 2: Low dimensional representation (UMAP) of DinoBloom-B features of over 80,000 single cells from the training set of the dataset AML Hehr. Center: UMAP with original images. Five arcs: UMAP for healthy patients (blue) and patients with CBFB::MYH11 (orange), NPM1 (green), PML::RARA (red), and RUNX1::RUNX1T1 (purple), for every class: all patients in the test set (bright) and embedding of one random test patient (dark). The myeloblast cluster and doublet cluster are barely populated for healthy controls. Different AML entities present with distinct cell patterns within the embedding.
  • Figure 3: PCA visualization of the patch tokens on the test data of Acevedo (external) and BMC. Comparison between DinoBloom-B, the second best model Phikon (ViT-B), and the pretrained DINOv2 ViT-B. Colors represent the values of the first three PCA components. DinoBloom-B can differentiate between nuclei, cytoplasm, surrounding red blood cells, and background.