Table of Contents
Fetching ...

Masked Representation Modeling for Domain-Adaptive Segmentation

Wenlve Zhou, Zhiheng Zhou, Tiantao Xian, Yikui Zhai, Weibin Wu, Biyun Ma

Abstract

Unsupervised domain adaptation (UDA) for semantic segmentation seeks to transfer models from a labeled source domain to an unlabeled target domain. While auxiliary self-supervised tasks such as contrastive learning have enhanced feature discriminability, masked modeling remains underexplored due to architectural constraints and misaligned objectives. We propose Masked Representation Modeling (MRM), an auxiliary task that performs representation masking and reconstruction directly in the latent space. Unlike prior masked modeling methods that reconstruct low-level signals (e.g., pixels or visual tokens), MRM targets high-level semantic features, aligning its objective with segmentation and integrating seamlessly into standard architectures like DeepLab and DAFormer. To support efficient reconstruction, we design a lightweight auxiliary module, Rebuilder, which is jointly trained with the segmentation network but removed during inference, introducing zero test-time overhead. Extensive experiments demonstrate that MRM consistently improves segmentation performance across diverse architectures and UDA benchmarks. When integrated with four representative baselines, MRM achieves an average gain of +2.3 mIoU on GTA $\rightarrow$ Cityscapes and +2.8 mIoU on Cityscapes $\rightarrow$ Synthia, establishing it as a simple, effective, and generalizable strategy for unsupervised domain-adaptive semantic segmentation.

Masked Representation Modeling for Domain-Adaptive Segmentation

Abstract

Unsupervised domain adaptation (UDA) for semantic segmentation seeks to transfer models from a labeled source domain to an unlabeled target domain. While auxiliary self-supervised tasks such as contrastive learning have enhanced feature discriminability, masked modeling remains underexplored due to architectural constraints and misaligned objectives. We propose Masked Representation Modeling (MRM), an auxiliary task that performs representation masking and reconstruction directly in the latent space. Unlike prior masked modeling methods that reconstruct low-level signals (e.g., pixels or visual tokens), MRM targets high-level semantic features, aligning its objective with segmentation and integrating seamlessly into standard architectures like DeepLab and DAFormer. To support efficient reconstruction, we design a lightweight auxiliary module, Rebuilder, which is jointly trained with the segmentation network but removed during inference, introducing zero test-time overhead. Extensive experiments demonstrate that MRM consistently improves segmentation performance across diverse architectures and UDA benchmarks. When integrated with four representative baselines, MRM achieves an average gain of +2.3 mIoU on GTA Cityscapes and +2.8 mIoU on Cityscapes Synthia, establishing it as a simple, effective, and generalizable strategy for unsupervised domain-adaptive semantic segmentation.

Paper Structure

This paper contains 19 sections, 9 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of three auxiliary tasks for UDA segmentation. (a) Contrastive Learning (CL) uses contrastive loss for feature alignment but does not train the decoder, limiting end-to-end optimization. (b) Masked Image Modeling (MIM) reconstructs masked components but disrupts the segmentation pipeline, reducing compatibility with certain architectures. (c) Masked Representation Modeling (MRM) performs masking and reconstruction in latent space, aligns with the segmentation task, remains compatible with diverse architectures, and improves performance without inference overhead, as the Rebuilder is used only during training.
  • Figure 2: The pipeline of Rebuilder. The Rebuilder is designed to randomly mask out representation from the encoder and reconstruct the masked component. It first scales the encoder representation along both spatial and channel dimensions, and then applies random masking to remove a subset of these representation. Subsequently, the masked representation are passed through several Transformer blocks and a projector to generate reconstructed representation, which are input to the decoder for model training. [M] is a learnable token.
  • Figure 3: An overview of the projector. The representation from the Transformer are reshaped and fed into the projector, which uses several transposed convolutions to generate features at different scales. (a) The projector details and (b) the multi-scale projector.
  • Figure 4: Masking ratio. The optimal performance enhancement is achieved when the masking ratio is adjusted to 40%.
  • Figure 5: Qualitative comparison of MRM with previous methods on GTA $\rightarrow$ Cityscapes. To ensure a fair comparison and to demonstrate MRM's capability in contextual semantic consistency and long-range dependency modeling, we uniformly adopt the DeepLabv2 chen2017deeplab with ResNet-101 he2016deep architecture.
  • ...and 1 more figures