Distill-SODA: Distilling Self-Supervised Vision Transformer for Source-Free Open-Set Domain Adaptation in Computational Pathology

Guillaume Vray; Devavrat Tomar; Jean-Philippe Thiran; Behzad Bozorgtabar

Distill-SODA: Distilling Self-Supervised Vision Transformer for Source-Free Open-Set Domain Adaptation in Computational Pathology

Guillaume Vray, Devavrat Tomar, Jean-Philippe Thiran, Behzad Bozorgtabar

TL;DR

Distill-SODA addresses the challenge of source-free open-set domain adaptation in computational pathology by distilling knowledge from a self-supervised vision transformer to adapt a source classifier without access to source data. It builds contextualized target embeddings via a ViT, partitions them with $K$-means, and computes refined closed-set prototypes using cluster-level attributes such as Closed-Set Affinity Score (CSAS) and Class Prior, producing robust target-domain prototypes. Prototypes are used to generate pseudo-labels and guide a distillation objective with a mean-squared-error loss, while an adversarial style augmentation (AdvStyle) enhances ViT self-training under covariate shifts. Across three public colorectal datasets, Distill-SODA achieves state-of-the-art results in both closed-set accuracy and open-set detection, demonstrating strong resilience to covariate shifts and data-efficiency, and is adaptable to various pre-trained or target-specific ViTs. This work offers practical value for clinical deployment by reducing labeling needs and privacy constraints while improving robustness to unseen tissue types.

Abstract

Developing computational pathology models is essential for reducing manual tissue typing from whole slide images, transferring knowledge from the source domain to an unlabeled, shifted target domain, and identifying unseen categories. We propose a practical setting by addressing the above-mentioned challenges in one fell swoop, i.e., source-free open-set domain adaptation. Our methodology focuses on adapting a pre-trained source model to an unlabeled target dataset and encompasses both closed-set and open-set classes. Beyond addressing the semantic shift of unknown classes, our framework also deals with a covariate shift, which manifests as variations in color appearance between source and target tissue samples. Our method hinges on distilling knowledge from a self-supervised vision transformer (ViT), drawing guidance from either robustly pre-trained transformer models or histopathology datasets, including those from the target domain. In pursuit of this, we introduce a novel style-based adversarial data augmentation, serving as hard positives for self-training a ViT, resulting in highly contextualized embeddings. Following this, we cluster semantically akin target images, with the source model offering weak pseudo-labels, albeit with uncertain confidence. To enhance this process, we present the closed-set affinity score (CSAS), aiming to correct the confidence levels of these pseudo-labels and to calculate weighted class prototypes within the contextualized embedding space. Our approach establishes itself as state-of-the-art across three public histopathological datasets for colorectal cancer assessment. Notably, our self-training method seamlessly integrates with open-set detection methods, resulting in enhanced performance in both closed-set and open-set recognition tasks.

Distill-SODA: Distilling Self-Supervised Vision Transformer for Source-Free Open-Set Domain Adaptation in Computational Pathology

TL;DR

-means, and computes refined closed-set prototypes using cluster-level attributes such as Closed-Set Affinity Score (CSAS) and Class Prior, producing robust target-domain prototypes. Prototypes are used to generate pseudo-labels and guide a distillation objective with a mean-squared-error loss, while an adversarial style augmentation (AdvStyle) enhances ViT self-training under covariate shifts. Across three public colorectal datasets, Distill-SODA achieves state-of-the-art results in both closed-set accuracy and open-set detection, demonstrating strong resilience to covariate shifts and data-efficiency, and is adaptable to various pre-trained or target-specific ViTs. This work offers practical value for clinical deployment by reducing labeling needs and privacy constraints while improving robustness to unseen tissue types.

Abstract

Paper Structure (23 sections, 15 equations, 10 figures, 5 tables)

This paper contains 23 sections, 15 equations, 10 figures, 5 tables.

Introduction
Related Work
Open-Set Detection
Source-Free Open-Set Domain Adaptation
Materials and Methods
Closed-Set Class Prototypes in ViT Embedding Space
Enhancing Semantic Consistency in ViT Embedding Space via K-means Clustering
Defining Cluster Attributes: Closed-Set Affinity Score (CSAS), Class Prior, and Class Conditional Mean
Closed-Set Affinity Score (CSAS)
Class Prior
Class Conditional Mean
Computing closed-set class prototypes using cluster attributes
Source-Free Adaptation via ViT Guided Closed-Set Class Prototypes
Self-Supervised ViT Training via Automatic Adversarial Style Augmentation
Experiments and Results
...and 8 more sections

Figures (10)

Figure 1: Unveiling the self-supervised vision transformer (ViT) for source-free open-set domain adaptation (SF-OSDA). The source model $f_s$ undergoes adaptation, resulting in the adapted model $f_t$ acclimating to an unlabeled target domain, accommodating both closed-set (known) and open-set (unknown) classes, all while maintaining a strict boundary of not accessing the source dataset. Our methodology capitalizes on distilling knowledge from a self-supervised ViT, leveraging its potent capability to generate contextually enriched target embeddings. Guidance for knowledge distillation can originate from two principal sources: self-supervised pre-trained transformer models without adaptation and models that have undergone extensive self-supervised pre-training on publicly available histopathology datasets or target domain data, showcasing our approach's adaptability.
Figure 2: ViT guided closed-set class prototypes. Utilizing the self-supervised ViT feature extractor $\mathcal{F}$ in conjunction with the source model $f_s$, we first obtain the contextualized embeddings for target images via $\mathcal{F}$, which are weakly labeled by $f_s$ and subsequently grouped into $K$ clusters using $K$-means clustering. Leveraging the closed-set affinity score (CSAS) detailed in Section \ref{['subsec:csas']}, we refine the confidence of weak pseudo labels and compute weighted class prototypes of closed-set (known) classes within the embedding space of $\mathcal{F}$. Note that the final closed-set class prototypes exhibit improved decision boundaries, with samples from open-set classes being assigned low confidence scores. In this illustration, the three shapes represent closed-set classes A, B, and open-set class, with the color intensity signifying the source model's confidence level in categorizing the target images into known classes A and B.
Figure 3: Vision transformer self-training with adversarial style augmentations. The style augmentation module $f_\text{style}$ is trained to learn magnitudes $\Hat{m}$ for adversarial augmentations by maximizing $\mathcal{L}_\text{DINO}$. Concurrently, the ViT encoder $\mathcal{F}$ is updated to minimize $\mathcal{L}_\text{DINO}$ on the target domain, establishing a dynamic adversarial training setup.
Figure 4: Breakdown of dataset splits: This illustrates how the dataset was segmented for the purposes of closed-set (indicated in green) and open-set (shown in red) within the three CRC datasets for tissue-type classification.
Figure 5: The t-SNE visualization comparison of feature embeddings obtained from self-supervised trained transformer encoder $\mathcal{F}$: This figure showcases the distinctions among tissue types within the target domain images (Kather-19, Split 1) using t-SNE van2008visualizing. Each tissue type is uniquely color-coded for clarity: (a) Reflects the actual labels; (b) Illustrates the feature embedding from the BatchNorm re-calibrated source model $f_s$ trained on Kather-16; and (c) Depicts the feature embedding from our proposed method, $\text{Distill-SODA}$. The weighted average class prototypes are obtained in the transformer’s embedding space, with the intensity of the color signifying the confidence level of the labels.
...and 5 more figures

Distill-SODA: Distilling Self-Supervised Vision Transformer for Source-Free Open-Set Domain Adaptation in Computational Pathology

TL;DR

Abstract

Distill-SODA: Distilling Self-Supervised Vision Transformer for Source-Free Open-Set Domain Adaptation in Computational Pathology

Authors

TL;DR

Abstract

Table of Contents

Figures (10)