Language-Guided Instance-Aware Domain-Adaptive Panoptic Segmentation
Elham Amin Mansour, Ozan Unal, Suman Saha, Benjamin Bejar, Luc Van Gool
TL;DR
This work tackles panoptic unsupervised domain adaptation by addressing the distinct challenges of semantic and instance alignment across domains. It introduces LIDAPS, an end-to-end framework that combines IMix, an instance-aware cross-domain mixing strategy that pastes high-confidence target instances onto source images, with CDA, a CLIP-based semantic regularizer that aligns both domains in a language-vision space. IMix improves instance segmentation by preserving exhaustive pseudo-labels, while CDA mitigates catastrophic forgetting of semantics, yielding balanced gains. Acrosssynthetic-to-real and real-to-real benchmarks, LIDAPS achieves state-of-the-art mean Panoptic Quality $mPQ$, demonstrating robust improvements in instance recognition while maintaining or enhancing semantic accuracy.
Abstract
The increasing relevance of panoptic segmentation is tied to the advancements in autonomous driving and AR/VR applications. However, the deployment of such models has been limited due to the expensive nature of dense data annotation, giving rise to unsupervised domain adaptation (UDA). A key challenge in panoptic UDA is reducing the domain gap between a labeled source and an unlabeled target domain while harmonizing the subtasks of semantic and instance segmentation to limit catastrophic interference. While considerable progress has been achieved, existing approaches mainly focus on the adaptation of semantic segmentation. In this work, we focus on incorporating instance-level adaptation via a novel instance-aware cross-domain mixing strategy IMix. IMix significantly enhances the panoptic quality by improving instance segmentation performance. Specifically, we propose inserting high-confidence predicted instances from the target domain onto source images, retaining the exhaustiveness of the resulting pseudo-labels while reducing the injected confirmation bias. Nevertheless, such an enhancement comes at the cost of degraded semantic performance, attributed to catastrophic forgetting. To mitigate this issue, we regularize our semantic branch by employing CLIP-based domain alignment (CDA), exploiting the domain-robustness of natural language prompts. Finally, we present an end-to-end model incorporating these two mechanisms called LIDAPS, achieving state-of-the-art results on all popular panoptic UDA benchmarks.
