Generalization of Self-Supervised Vision Transformers for Protein Localization Across Microscopy Domains
Ben Isselmann, Dilara Göksu, Andreas Weinmann
TL;DR
The paper addresses cross-domain generalization of self-supervised Vision Transformers (DINO) for protein localization across microscopy domains with differing channels. It compares DINO backbones pretrained on ImageNet-1k, HPA, and OpenCell, evaluating two embedding strategies and a two-stage downstream classifier on OpenCell. The key finding is that the HPA-pretrained model with channel mapping achieves the best mean macro $F_1$ score of $0.8221 \pm 0.0062$, closely rivaling a model trained directly on OpenCell. These results demonstrate that domain-relevant SSL representations can generalize across related microscopy datasets, enabling strong downstream performance even with limited task-specific labels.
Abstract
Task-specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self-supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet-1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy-specific HPA-pretrained model achieving the best performance (mean macro $F_1$-score = 0.8221 \pm 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 \pm 0.0090). These results highlight the value of large-scale pretraining and indicate that domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task-specific labeled data are limited.
