Table of Contents
Fetching ...

TMT: Cross-domain Semantic Segmentation with Region-adaptive Transferability Estimation

Enming Zhang, Zhengyu Li, Yanru Wu, Jingge Wang, Yang Tan, Guan Wang, Yang Li, Xiaoping Zhang

TL;DR

This paper tackles cross-domain semantic segmentation with Vision Transformers by addressing region-wise transferability. It introduces ACTE, an Adaptive Cluster-based Transferability Estimator, to segment images into coherent regions and estimate region transferability, and Transferable Masked Attention (TMA) to gate the self-attention mechanism with these region-transferability cues. The提出 ACTE and TMA together yield objective improvements across 20 source-target pairs on five benchmarks, demonstrating robust handling of domain shifts and better region-level segmentation boundaries. The results suggest that region-adaptive transferability guidance in Transformer-based segmentation offers substantial practical benefits for real-world cross-domain applications.

Abstract

Recent advances in Vision Transformers (ViTs) have significantly advanced semantic segmentation performance. However, their adaptation to new target domains remains challenged by distribution shifts, which often disrupt global attention mechanisms. While existing global and patch-level adaptation methods offer some improvements, they overlook the spatially varying transferability inherent in different image regions. To address this, we propose the Transferable Mask Transformer (TMT), a region-adaptive framework designed to enhance cross-domain representation learning through transferability guidance. First, we dynamically partition the image into coherent regions, grouped by structural and semantic similarity, and estimates their domain transferability at a localized level. Then, we incorporate region-level transferability maps directly into the self-attention mechanism of ViTs, allowing the model to adaptively focus attention on areas with lower transferability and higher semantic uncertainty. Extensive experiments across 20 diverse cross-domain settings demonstrate that TMT not only mitigates the performance degradation typically associated with domain shift but also consistently outperforms existing approaches.

TMT: Cross-domain Semantic Segmentation with Region-adaptive Transferability Estimation

TL;DR

This paper tackles cross-domain semantic segmentation with Vision Transformers by addressing region-wise transferability. It introduces ACTE, an Adaptive Cluster-based Transferability Estimator, to segment images into coherent regions and estimate region transferability, and Transferable Masked Attention (TMA) to gate the self-attention mechanism with these region-transferability cues. The提出 ACTE and TMA together yield objective improvements across 20 source-target pairs on five benchmarks, demonstrating robust handling of domain shifts and better region-level segmentation boundaries. The results suggest that region-adaptive transferability guidance in Transformer-based segmentation offers substantial practical benefits for real-world cross-domain applications.

Abstract

Recent advances in Vision Transformers (ViTs) have significantly advanced semantic segmentation performance. However, their adaptation to new target domains remains challenged by distribution shifts, which often disrupt global attention mechanisms. While existing global and patch-level adaptation methods offer some improvements, they overlook the spatially varying transferability inherent in different image regions. To address this, we propose the Transferable Mask Transformer (TMT), a region-adaptive framework designed to enhance cross-domain representation learning through transferability guidance. First, we dynamically partition the image into coherent regions, grouped by structural and semantic similarity, and estimates their domain transferability at a localized level. Then, we incorporate region-level transferability maps directly into the self-attention mechanism of ViTs, allowing the model to adaptively focus attention on areas with lower transferability and higher semantic uncertainty. Extensive experiments across 20 diverse cross-domain settings demonstrate that TMT not only mitigates the performance degradation typically associated with domain shift but also consistently outperforms existing approaches.

Paper Structure

This paper contains 14 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Fig.a illustrates domain gaps across datasets, reflecting real-world challenges. Fig.b–c show that vanilla fine-tuning produces misleading attention masks (red), focusing on irrelevant areas, while our method yields accurate masks (blue) by emphasizing task-critical regions like the car. Fig.d compares region partition methods, with our approach adaptively segmenting images for better transferability estimation.
  • Figure 2: Overview of the framework. The model training begins with the ACTE, which is first trained using both source and target data. The lower section illustrates how ACTE evaluates and assigns different transferability scores to various regions, represented by different colors in the transferability map. After ACTE has been trained, its output—region-level transferability maps—is used to guide the training of the TMA within attention mechanism (top-right).
  • Figure 3: Visualization of segmentation results where models pretrained on Cityscapes are transferred to the BDD dataset.
  • Figure 4: Visualization of transferability maps between the Cityscapes (target domain) and Mapillary (source domain) datasets. The maps indicate regions with high transferability in deep colors and low transferability in lighter colors.