Table of Contents
Fetching ...

Language-Guided Instance-Aware Domain-Adaptive Panoptic Segmentation

Elham Amin Mansour, Ozan Unal, Suman Saha, Benjamin Bejar, Luc Van Gool

TL;DR

This work tackles panoptic unsupervised domain adaptation by addressing the distinct challenges of semantic and instance alignment across domains. It introduces LIDAPS, an end-to-end framework that combines IMix, an instance-aware cross-domain mixing strategy that pastes high-confidence target instances onto source images, with CDA, a CLIP-based semantic regularizer that aligns both domains in a language-vision space. IMix improves instance segmentation by preserving exhaustive pseudo-labels, while CDA mitigates catastrophic forgetting of semantics, yielding balanced gains. Acrosssynthetic-to-real and real-to-real benchmarks, LIDAPS achieves state-of-the-art mean Panoptic Quality $mPQ$, demonstrating robust improvements in instance recognition while maintaining or enhancing semantic accuracy.

Abstract

The increasing relevance of panoptic segmentation is tied to the advancements in autonomous driving and AR/VR applications. However, the deployment of such models has been limited due to the expensive nature of dense data annotation, giving rise to unsupervised domain adaptation (UDA). A key challenge in panoptic UDA is reducing the domain gap between a labeled source and an unlabeled target domain while harmonizing the subtasks of semantic and instance segmentation to limit catastrophic interference. While considerable progress has been achieved, existing approaches mainly focus on the adaptation of semantic segmentation. In this work, we focus on incorporating instance-level adaptation via a novel instance-aware cross-domain mixing strategy IMix. IMix significantly enhances the panoptic quality by improving instance segmentation performance. Specifically, we propose inserting high-confidence predicted instances from the target domain onto source images, retaining the exhaustiveness of the resulting pseudo-labels while reducing the injected confirmation bias. Nevertheless, such an enhancement comes at the cost of degraded semantic performance, attributed to catastrophic forgetting. To mitigate this issue, we regularize our semantic branch by employing CLIP-based domain alignment (CDA), exploiting the domain-robustness of natural language prompts. Finally, we present an end-to-end model incorporating these two mechanisms called LIDAPS, achieving state-of-the-art results on all popular panoptic UDA benchmarks.

Language-Guided Instance-Aware Domain-Adaptive Panoptic Segmentation

TL;DR

This work tackles panoptic unsupervised domain adaptation by addressing the distinct challenges of semantic and instance alignment across domains. It introduces LIDAPS, an end-to-end framework that combines IMix, an instance-aware cross-domain mixing strategy that pastes high-confidence target instances onto source images, with CDA, a CLIP-based semantic regularizer that aligns both domains in a language-vision space. IMix improves instance segmentation by preserving exhaustive pseudo-labels, while CDA mitigates catastrophic forgetting of semantics, yielding balanced gains. Acrosssynthetic-to-real and real-to-real benchmarks, LIDAPS achieves state-of-the-art mean Panoptic Quality , demonstrating robust improvements in instance recognition while maintaining or enhancing semantic accuracy.

Abstract

The increasing relevance of panoptic segmentation is tied to the advancements in autonomous driving and AR/VR applications. However, the deployment of such models has been limited due to the expensive nature of dense data annotation, giving rise to unsupervised domain adaptation (UDA). A key challenge in panoptic UDA is reducing the domain gap between a labeled source and an unlabeled target domain while harmonizing the subtasks of semantic and instance segmentation to limit catastrophic interference. While considerable progress has been achieved, existing approaches mainly focus on the adaptation of semantic segmentation. In this work, we focus on incorporating instance-level adaptation via a novel instance-aware cross-domain mixing strategy IMix. IMix significantly enhances the panoptic quality by improving instance segmentation performance. Specifically, we propose inserting high-confidence predicted instances from the target domain onto source images, retaining the exhaustiveness of the resulting pseudo-labels while reducing the injected confirmation bias. Nevertheless, such an enhancement comes at the cost of degraded semantic performance, attributed to catastrophic forgetting. To mitigate this issue, we regularize our semantic branch by employing CLIP-based domain alignment (CDA), exploiting the domain-robustness of natural language prompts. Finally, we present an end-to-end model incorporating these two mechanisms called LIDAPS, achieving state-of-the-art results on all popular panoptic UDA benchmarks.
Paper Structure (20 sections, 29 equations, 9 figures, 9 tables)

This paper contains 20 sections, 29 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: While previous SOTA methods for panoptic UDA such as EDAPS edaps achieve good semantic segmentation performance, they struggle to predict correct object boundaries and thus instance segmentation masks.
  • Figure 1: Comparison of LIDAPS with SOTA on different aspects such as self-training (ST) type; ST feature space: semantic (Sem) vs. instance (Inst); shared (SR) vs. task-specific (TR) representations; sampling strategies: ClassMix classmix vs. proposed IMix (Sec. \ref{['sec:imix']}); and proposed CLIP-based domain alignment (CDA).
  • Figure 2: Illustration of the LIDAPS pipeline. (Green) The baseline panoptic UDA model is built on a mean-teacher framework and consists of a common transformer encoder and individual task decoders. The student model is supervised directly from source domain labels as well as semantically mixed inputs whose labels are generated by the teacher model. (Blue) We apply IMix to further adapt the instance segmentation branch of LIDAPS, mixing high-confidence predicted target instances with source images. Blue paths are only active when self-training with IMix is enabled. (Orange) We regularize the semantic branch via CLIP-based domain alignment that utilizes similarity maps to reduce catastrophic forgetting.
  • Figure 3: Our Proposed IMix cuts and pastes pseudo-instances with a confidence score above a certain threshold from source to target while in DACS, half of the semantic classes are pasted from source to target without preserving instance-level information.
  • Figure 4: Pipeline used to compute the pixel-text similarity map for CLIP-based domain alignment. We generate class-wise CLIP mean features from a series of fixed text prompts (a-c). C: $\#$semantic classes, D: dimension of text encodings, H, W: width and height of image, P: $\#$prompts (e.g. A painting of a person). The similarity maps are given by the inner product with the semantic decoder features.
  • ...and 4 more figures