Table of Contents
Fetching ...

MC-PanDA: Mask Confidence for Panoptic Domain Adaptation

Ivan Martinović, Josip Šarić, Siniša Šegvić

TL;DR

MC-PanDA tackles domain-adaptive panoptic segmentation by exploiting mask transformers' uncertainty estimates through two mechanisms: Mask-wide Loss Scaling (MLS) and Confidence-based Point Filtering (CBPF). These components downweight and selectively sample learning signals from target-domain pseudo-labels, mitigating noise amplification in Mean-Teacher self-training. The method achieves a new state-of-the-art on Synthia→Cityscapes (47.4 $PQ_{16}$, +6.2pp) and shows strong gains across other synthetic-to-real benchmarks, with ablations validating the complementary contribution of MLS and CBPF. This region-aware uncertainty approach advances practical panoptic domain adaptation, offering improved robustness to domain shift and promising avenues for autonomous scene understanding in unlabeled target domains.

Abstract

Domain adaptive panoptic segmentation promises to resolve the long tail of corner cases in natural scene understanding. Previous state of the art addresses this problem with cross-task consistency, careful system-level optimization and heuristic improvement of teacher predictions. In contrast, we propose to build upon remarkable capability of mask transformers to estimate their own prediction uncertainty. Our method avoids noise amplification by leveraging fine-grained confidence of panoptic teacher predictions. In particular, we modulate the loss with mask-wide confidence and discourage back-propagation in pixels with uncertain teacher or confident student. Experimental evaluation on standard benchmarks reveals a substantial contribution of the proposed selection techniques. We report 47.4 PQ on Synthia to Cityscapes, which corresponds to an improvement of 6.2 percentage points over the state of the art. The source code is available at https://github.com/helen1c/MC-PanDA.

MC-PanDA: Mask Confidence for Panoptic Domain Adaptation

TL;DR

MC-PanDA tackles domain-adaptive panoptic segmentation by exploiting mask transformers' uncertainty estimates through two mechanisms: Mask-wide Loss Scaling (MLS) and Confidence-based Point Filtering (CBPF). These components downweight and selectively sample learning signals from target-domain pseudo-labels, mitigating noise amplification in Mean-Teacher self-training. The method achieves a new state-of-the-art on Synthia→Cityscapes (47.4 , +6.2pp) and shows strong gains across other synthetic-to-real benchmarks, with ablations validating the complementary contribution of MLS and CBPF. This region-aware uncertainty approach advances practical panoptic domain adaptation, offering improved robustness to domain shift and promising avenues for autonomous scene understanding in unlabeled target domains.

Abstract

Domain adaptive panoptic segmentation promises to resolve the long tail of corner cases in natural scene understanding. Previous state of the art addresses this problem with cross-task consistency, careful system-level optimization and heuristic improvement of teacher predictions. In contrast, we propose to build upon remarkable capability of mask transformers to estimate their own prediction uncertainty. Our method avoids noise amplification by leveraging fine-grained confidence of panoptic teacher predictions. In particular, we modulate the loss with mask-wide confidence and discourage back-propagation in pixels with uncertain teacher or confident student. Experimental evaluation on standard benchmarks reveals a substantial contribution of the proposed selection techniques. We report 47.4 PQ on Synthia to Cityscapes, which corresponds to an improvement of 6.2 percentage points over the state of the art. The source code is available at https://github.com/helen1c/MC-PanDA.
Paper Structure (23 sections, 7 equations, 13 figures, 10 tables)

This paper contains 23 sections, 7 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: MC-PanDA complements the Mean Teacher for domain-adaptive panoptics with fine-grained uncertainty quantification. We modulate the dense localization loss of the $i$-th mask $\mathcal{L}_{m_{i}}$ with the mask-wide confidence $\lambda_i$, and sample it with respect to pixel-level affinity $A_i$ that blends teacher confidence $\Phi$ and student uncertainty (cf. Eq. \ref{['eq:unc']}). Our contributions encourage cautious self-learning from uncertain pseudo-labels. Note that we show detailed loss sampling and weighting only for the mask $N$. This procedure is carried out for each student mask that maps to a non-empty teacher mask. For simplicity, the figure omits source domain branch and SegMix augmentation.
  • Figure 2: The teacher multiplies the dense assignment masks $\sigma$ with mask-wide max-softmax in order to recover dense per-mask confidence $\rho$. Taking the max along the mask axis of $\rho$ and thresholding with respect to $\tau_2$ reveals binary map of uncertain pixels (\ref{['eq:unc']}). Furthermore, the mask axis arg-max of $\rho$ delivers the map of panoptic indices, which we further convert to per-mask binary maps by converting indices to their one-hot encodings. Finally, we determine per-mask teacher confidence $\boldsymbol{\lambda} = \left\{ \lambda_i \right\}$ (\ref{['eq:lambda']}) as the relative count of dense per-mask confidences that are greater than $\tau_1$.
  • Figure 3: Qualitative comparison with the state-of-the-art Saha_2023_ICCV on Synthia$\rightarrow$Cityscapes and Synthia$\rightarrow$Vistas. Dashed polygons indicate regions where our method prevails.
  • Figure 4: Panoptic predictions in different training iterations. Top row presents our consistency baseline, while the bottom row presents our approach. Best viewed zoomed in.
  • Figure 5: We visualize the masks corresponding to road and traffic signs, and their MLS factors at three checkpoints. There is a strong correlation between the MLS factor $\lambda$ and visual quality.
  • ...and 8 more figures