Table of Contents
Fetching ...

Generalization of Self-Supervised Vision Transformers for Protein Localization Across Microscopy Domains

Ben Isselmann, Dilara Göksu, Andreas Weinmann

TL;DR

The paper addresses cross-domain generalization of self-supervised Vision Transformers (DINO) for protein localization across microscopy domains with differing channels. It compares DINO backbones pretrained on ImageNet-1k, HPA, and OpenCell, evaluating two embedding strategies and a two-stage downstream classifier on OpenCell. The key finding is that the HPA-pretrained model with channel mapping achieves the best mean macro $F_1$ score of $0.8221 \pm 0.0062$, closely rivaling a model trained directly on OpenCell. These results demonstrate that domain-relevant SSL representations can generalize across related microscopy datasets, enabling strong downstream performance even with limited task-specific labels.

Abstract

Task-specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self-supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet-1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy-specific HPA-pretrained model achieving the best performance (mean macro $F_1$-score = 0.8221 \pm 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 \pm 0.0090). These results highlight the value of large-scale pretraining and indicate that domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task-specific labeled data are limited.

Generalization of Self-Supervised Vision Transformers for Protein Localization Across Microscopy Domains

TL;DR

The paper addresses cross-domain generalization of self-supervised Vision Transformers (DINO) for protein localization across microscopy domains with differing channels. It compares DINO backbones pretrained on ImageNet-1k, HPA, and OpenCell, evaluating two embedding strategies and a two-stage downstream classifier on OpenCell. The key finding is that the HPA-pretrained model with channel mapping achieves the best mean macro score of , closely rivaling a model trained directly on OpenCell. These results demonstrate that domain-relevant SSL representations can generalize across related microscopy datasets, enabling strong downstream performance even with limited task-specific labels.

Abstract

Task-specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self-supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet-1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy-specific HPA-pretrained model achieving the best performance (mean macro -score = 0.8221 \pm 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 \pm 0.0090). These results highlight the value of large-scale pretraining and indicate that domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task-specific labeled data are limited.
Paper Structure (14 sections, 6 equations, 2 figures, 3 tables)

This paper contains 14 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparison of mean macro F1 scores across different model configurations, evaluated over 5-fold cross-validation. Bars represent the mean F1 score, and error bars indicate $\pm$ one standard deviation across folds. Model variants are grouped by pretraining datasets (HPA FOV, ImageNet-1k, OpenCell) and embedding approach (channel replication vs. channel mapping).
  • Figure 2: Visualization of the downstream analysis for two OpenCell samples, illustrating the predicted protein localization. The composite image (left) shows merged channels for protein (red) and nucleus (green). Individual channels are displayed separately (middle panels) alongside a classification table (right) with ground truth and predictions obtained by the best-performing fold of the 5-fold cross-validation using DINO pretrained on HPA FOV data with the channel mapping strategy. For the protein's subcellular localization, the model accurately predicted both the cytoplasmic and nucleoplasm compartments.