Table of Contents
Fetching ...

MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation

Linyan Yang, Lukas Hoyer, Mark Weber, Tobias Fischer, Dengxin Dai, Laura Leal-Taixé, Marc Pollefeys, Daniel Cremers, Luc Van Gool

TL;DR

The paper addresses domain shift in semantic segmentation and the poor handling of fine structures by appearance-based UDA methods. It proposes MICDrop, which uses depth as a complementary modality and introduces complementary feature masking and a cross-modality fusion module to enforce joint RGB-depth representations, acting as a plugin for existing UDA approaches. Evaluations on GTA→Cityscapes and SYNTHIA→Cityscapes show consistent mIoU gains across DAFormer, HRDA, and MIC, achieving new state-of-the-art on GTA→Cityscapes and robust improvements on the other benchmark. The results demonstrate that depth-guided fusion and complementary dropout effectively reduce domain gaps, especially for thin structures and boundary delineation, with practical plugin applicability and lightweight training.

Abstract

Unsupervised Domain Adaptation (UDA) is the task of bridging the domain gap between a labeled source domain, e.g., synthetic data, and an unlabeled target domain. We observe that current UDA methods show inferior results on fine structures and tend to oversegment objects with ambiguous appearance. To address these shortcomings, we propose to leverage geometric information, i.e., depth predictions, as depth discontinuities often coincide with segmentation boundaries. We show that naively incorporating depth into current UDA methods does not fully exploit the potential of this complementary information. To this end, we present MICDrop, which learns a joint feature representation by masking image encoder features while inversely masking depth encoder features. With this simple yet effective complementary masking strategy, we enforce the use of both modalities when learning the joint feature representation. To aid this process, we propose a feature fusion module to improve both global as well as local information sharing while being robust to errors in the depth predictions. We show that our method can be plugged into various recent UDA methods and consistently improve results across standard UDA benchmarks, obtaining new state-of-the-art performances.

MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation

TL;DR

The paper addresses domain shift in semantic segmentation and the poor handling of fine structures by appearance-based UDA methods. It proposes MICDrop, which uses depth as a complementary modality and introduces complementary feature masking and a cross-modality fusion module to enforce joint RGB-depth representations, acting as a plugin for existing UDA approaches. Evaluations on GTA→Cityscapes and SYNTHIA→Cityscapes show consistent mIoU gains across DAFormer, HRDA, and MIC, achieving new state-of-the-art on GTA→Cityscapes and robust improvements on the other benchmark. The results demonstrate that depth-guided fusion and complementary dropout effectively reduce domain gaps, especially for thin structures and boundary delineation, with practical plugin applicability and lightweight training.

Abstract

Unsupervised Domain Adaptation (UDA) is the task of bridging the domain gap between a labeled source domain, e.g., synthetic data, and an unlabeled target domain. We observe that current UDA methods show inferior results on fine structures and tend to oversegment objects with ambiguous appearance. To address these shortcomings, we propose to leverage geometric information, i.e., depth predictions, as depth discontinuities often coincide with segmentation boundaries. We show that naively incorporating depth into current UDA methods does not fully exploit the potential of this complementary information. To this end, we present MICDrop, which learns a joint feature representation by masking image encoder features while inversely masking depth encoder features. With this simple yet effective complementary masking strategy, we enforce the use of both modalities when learning the joint feature representation. To aid this process, we propose a feature fusion module to improve both global as well as local information sharing while being robust to errors in the depth predictions. We show that our method can be plugged into various recent UDA methods and consistently improve results across standard UDA benchmarks, obtaining new state-of-the-art performances.
Paper Structure (17 sections, 5 equations, 10 figures, 6 tables)

This paper contains 17 sections, 5 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Previous UDA methods such as MIC MIC struggle with the segmentation of fine structures (top row) and oversegmentation of difficult objects (bottom row). Therefore, we propose MICDrop to improve semantic segmentation UDA with depth estimates, which can capture fine structures and are consistent within object boundaries. We apply MICDrop to four different methods on the GTA$\to$Cityscapes benchmark and show consistent improvements.
  • Figure 2: Method overview. Our proposed architecture is visualized on the left side. We use a light-weight hierarchical depth encoder and process the features in our proposed cross-modal feature fusion module. On the right side, we illustrate our training pipeline, in which source and target images are fed through the student encoders. Then, our proposed cross-modality complementary dropout is applied to the corresponding features on each feature resolution. Finally, we feed them through our fusion block, followed by the decoder, to make a final prediction.
  • Figure 3: Feature fusion of RGB and depth. The presented method comprises two key components: a global and a local attention module. The local attention module refines information coming from depth within a local window by using sigmoid gates. In contrast to that, the global attention module aggregates image features based on similarity in their corresponding depth features, and thus providing more global context. Finally, the residual feature fusion block fuses all features.
  • Figure 4: Qualitative results. These results show the improvements of MICDrop in comparison to MIC (HRDA). We highlight improvements on thin structures, such as pole and traffic sign, as well as on larger objects like trucks, busses and fences. In rows 1, 3, and 4, we can see that thin structures have a distinct depth profile, which helps in predicting accurate boundaries. In rows 2, 4, and 5, we observe that the depth region for the fence, bus, and truck is smooth, improving the consistency of the predicted segmentation.
  • Figure S1: Classwise performance. This figure highlights not only improved average performance but also a reduction of strong deviations in classwise performances when using a frozen backbone. The dotted checkpoint line indicates the model's performance at its initialization with pretrained weights.
  • ...and 5 more figures