Table of Contents
Fetching ...

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

Caterina Fuster-Barceló, Virginie Uhlmann

TL;DR

This study probes whether vision foundation models (VFMs) pretrained on natural images can serve as foundational representations for EM mitochondria segmentation across heterogeneous datasets. By comparing frozen-backbone head-only training and LoRA-based PEFT across three VFMs (DINOv2, DINOv3, OpenCLIP) on Lucchi++ and VNC, the authors show strong in-domain performance but substantial degradation when training on the union of datasets due to persistent domain mismatch in the latent space. Domain-mismatch diagnostics using PCA, Fréchet distance $FD ext{-}DINOv2$, and linear probes reveal that Lucchi++ and VNC occupy distinct regions despite visual similarity, indicating a need for explicit domain-alignment mechanisms beyond lightweight adaptation. The findings suggest VFMs are not yet foundational enough to generalize across heterogeneous EM data without additional domain-alignment strategies, though they offer practical utility when single-domain adaptation is sufficient and data/compute budgets are constrained.

Abstract

Although vision foundation models (VFMs) are increasingly reused for biomedical image analysis, it remains unclear whether the latent representations they provide are general enough to support effective transfer and reuse across heterogeneous microscopy image datasets. Here, we study this question for the problem of mitochondria segmentation in electron microscopy (EM) images, using two popular public EM datasets (Lucchi++ and VNC) and three recent representative VFMs (DINOv2, DINOv3, and OpenCLIP). We evaluate two practical model adaptation regimes: a frozen-backbone setting in which only a lightweight segmentation head is trained on top of the VFM, and parameter-efficient fine-tuning (PEFT) via Low-Rank Adaptation (LoRA) in which the VFM is fine-tuned in a targeted manner to a specific dataset. Across all backbones, we observe that training on a single EM dataset yields good segmentation performance (quantified as foreground Intersection-over-Union), and that LoRA consistently improves in-domain performance. In contrast, training on multiple EM datasets leads to severe performance degradation for all models considered, with only marginal gains from PEFT. Exploration of the latent representation space through various techniques (PCA, Fréchet Dinov2 distance, and linear probes) reveals a pronounced and persistent domain mismatch between the two considered EM datasets in spite of their visual similarity, which is consistent with the observed failure of paired training. These results suggest that, while VFMs can deliver competitive results for EM segmentation within a single domain under lightweight adaptation, current PEFT strategies are insufficient to obtain a single robust model across heterogeneous EM datasets without additional domain-alignment mechanisms.

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

TL;DR

This study probes whether vision foundation models (VFMs) pretrained on natural images can serve as foundational representations for EM mitochondria segmentation across heterogeneous datasets. By comparing frozen-backbone head-only training and LoRA-based PEFT across three VFMs (DINOv2, DINOv3, OpenCLIP) on Lucchi++ and VNC, the authors show strong in-domain performance but substantial degradation when training on the union of datasets due to persistent domain mismatch in the latent space. Domain-mismatch diagnostics using PCA, Fréchet distance , and linear probes reveal that Lucchi++ and VNC occupy distinct regions despite visual similarity, indicating a need for explicit domain-alignment mechanisms beyond lightweight adaptation. The findings suggest VFMs are not yet foundational enough to generalize across heterogeneous EM data without additional domain-alignment strategies, though they offer practical utility when single-domain adaptation is sufficient and data/compute budgets are constrained.

Abstract

Although vision foundation models (VFMs) are increasingly reused for biomedical image analysis, it remains unclear whether the latent representations they provide are general enough to support effective transfer and reuse across heterogeneous microscopy image datasets. Here, we study this question for the problem of mitochondria segmentation in electron microscopy (EM) images, using two popular public EM datasets (Lucchi++ and VNC) and three recent representative VFMs (DINOv2, DINOv3, and OpenCLIP). We evaluate two practical model adaptation regimes: a frozen-backbone setting in which only a lightweight segmentation head is trained on top of the VFM, and parameter-efficient fine-tuning (PEFT) via Low-Rank Adaptation (LoRA) in which the VFM is fine-tuned in a targeted manner to a specific dataset. Across all backbones, we observe that training on a single EM dataset yields good segmentation performance (quantified as foreground Intersection-over-Union), and that LoRA consistently improves in-domain performance. In contrast, training on multiple EM datasets leads to severe performance degradation for all models considered, with only marginal gains from PEFT. Exploration of the latent representation space through various techniques (PCA, Fréchet Dinov2 distance, and linear probes) reveals a pronounced and persistent domain mismatch between the two considered EM datasets in spite of their visual similarity, which is consistent with the observed failure of paired training. These results suggest that, while VFMs can deliver competitive results for EM segmentation within a single domain under lightweight adaptation, current PEFT strategies are insufficient to obtain a single robust model across heterogeneous EM datasets without additional domain-alignment mechanisms.
Paper Structure (39 sections, 2 equations, 4 figures, 8 tables)

This paper contains 39 sections, 2 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Domain mismatch between Lucchi++ and VNC in DINOv2 feature space. (a) PCA projection of image-level embeddings from the frozen DINOv2-L/14 backbone exhibits no overlap between the two datasets. (b) With LoRA enabled, the feature-space gap is minimally reduced, and the two domains remain clearly separable. We also report the FD-DINOv2 (\ref{['eq:fddino']}) and LR performance for predicting the dataset label from the embedding vectors (frozen: $79.32$; LoRA: $45.02$; probe accuracy/AUROC $=1.0$ in both).
  • Figure 2: Comparison to prior EM mitochondria segmentation approaches. Foreground IoU (IoU$_\mathrm{fg}$) for supervised baselines (top, light blue), zero-shot and prompt-based methods (middle, blue), and our VFM-based models with frozen (bottom, dark blue) and LoRA-adapted backbones (bottom, green). While VFMs are competitive in a single-dataset training scenario, performance collapses in the paired training setting, highlighting the impact of intra-EM domain mismatch.
  • Figure 3: Illustration of the two EM datasets used in this work. Representative slices from (a) Lucchi++ (mouse hippocampus, FIB-SEM) and (b) VNC (Drosophila ventral nerve cord, ssTEM), shown with their corresponding binary mitochondria ground-truth masks. The red box and inset highlight a local region to emphasize differences in texture/contrast and membrane appearance across datasets, despite both being EM images.
  • Figure 4: Domain mismatch between Lucchi++ and VNC (Stack2) in DINOv2 feature space. (a) PCA projection of image-level embeddings from the frozen DINOv2-L/14 backbone exhibits no overlap between the two datasets. (b) With LoRA enabled, the feature-space gap is minimally reduced, and the two domains remain clearly separable.