Table of Contents
Fetching ...

Masked Gamma-SSL: Learning Uncertainty Estimation via Masked Image Modeling

David S. W. Williams, Matthew Gadd, Paul Newman, Daniele De Martini

TL;DR

The paper addresses the need for reliable runtime uncertainty estimates in semantic segmentation under distributional shift for safety-critical robotics. It introduces a three-stage framework—pretraining with general representations, task-specific supervised learning on a source domain, and uncertainty training on unlabelled target-domain data—built on Masked Image Modeling with a masking-based consistency objective, implemented via the masked consistency loss $L_c$ and masking mask $M_\gamma^{\phi}$. A core contribution is the Mask-d2 model, which uses unlabelled data to achieve high-quality, single-pass uncertainty estimates and outperforms OoD and uncertainty baselines on SAX targets while generalising to unseen domains like WildDash. This approach reduces runtime latency for safety-critical perception while providing calibrated uncertainty, enabling safer actuation and interaction with potentially unseen scenes.

Abstract

This work proposes a semantic segmentation network that produces high-quality uncertainty estimates in a single forward pass. We exploit general representations from foundation models and unlabelled datasets through a Masked Image Modeling (MIM) approach, which is robust to augmentation hyper-parameters and simpler than previous techniques. For neural networks used in safety-critical applications, bias in the training data can lead to errors; therefore it is crucial to understand a network's limitations at run time and act accordingly. To this end, we test our proposed method on a number of test domains including the SAX Segmentation benchmark, which includes labelled test data from dense urban, rural and off-road driving domains. The proposed method consistently outperforms uncertainty estimation and Out-of-Distribution (OoD) techniques on this difficult benchmark.

Masked Gamma-SSL: Learning Uncertainty Estimation via Masked Image Modeling

TL;DR

The paper addresses the need for reliable runtime uncertainty estimates in semantic segmentation under distributional shift for safety-critical robotics. It introduces a three-stage framework—pretraining with general representations, task-specific supervised learning on a source domain, and uncertainty training on unlabelled target-domain data—built on Masked Image Modeling with a masking-based consistency objective, implemented via the masked consistency loss and masking mask . A core contribution is the Mask-d2 model, which uses unlabelled data to achieve high-quality, single-pass uncertainty estimates and outperforms OoD and uncertainty baselines on SAX targets while generalising to unseen domains like WildDash. This approach reduces runtime latency for safety-critical perception while providing calibrated uncertainty, enabling safer actuation and interaction with potentially unseen scenes.

Abstract

This work proposes a semantic segmentation network that produces high-quality uncertainty estimates in a single forward pass. We exploit general representations from foundation models and unlabelled datasets through a Masked Image Modeling (MIM) approach, which is robust to augmentation hyper-parameters and simpler than previous techniques. For neural networks used in safety-critical applications, bias in the training data can lead to errors; therefore it is crucial to understand a network's limitations at run time and act accordingly. To this end, we test our proposed method on a number of test domains including the SAX Segmentation benchmark, which includes labelled test data from dense urban, rural and off-road driving domains. The proposed method consistently outperforms uncertainty estimation and Out-of-Distribution (OoD) techniques on this difficult benchmark.
Paper Structure (29 sections, 4 equations, 4 figures, 2 tables)

This paper contains 29 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Our proposed method jointly performs high-quality semantic segmentation (top) and pixel-wise uncertainty estimation (bottom) on an image (top, top left) from the SAX London test dataset. Note a dumpster (an undefined semantic class) is inaccurately segmented, however our network is correspondingly uncertain (bottom, yellow and red), while the rest of the segmentation is accurate and certain (bottom, blue). See \ref{['appendix']} for more qualitative results.
  • Figure 2: Overview of our uncertainty training framework. It involves frozen network $\mathtt{f}_{\theta}$ (blue) and the network being trained $\mathtt{f}_{\phi}$ (purple). $\mathtt{f}_{\theta}$ has been trained to perform semantic segmentation with images from a labelled domain. Using unlabelled images in a different domain, regions of likely segmentation error are found by comparing masked segmentation $s_{\phi}^{m}$ (masking denoted by $\boxtimes$) and $s_\theta$. Loss $L_c$ refines the uncertainty estimates of $\mathtt{f}_{\phi}$ by minimising the soft consistency $\mathcal{H}$ ①, but only for regions where $\mathtt{f}_{\phi}$ is certain, implemented via an uncertainty masking procedure by binary mask $M_\gamma^{\phi}$ ② (denoted by $\boxplus$). The threshold for $M_\gamma^{\phi}$ is calculated using a hard consistency mask ③, giving the final loss ④.
  • Figure 3: In these plots, we measure misclassification detection performance using $\mathrm{F_{0.5}}$ scores plotted against the proportion of pixels that are certain and accurate $\mathrm{p(a,c)}$. The baselines are trained only with labelled Cityscapes data, while our proposed model, $\texttt{Mask-d2}$, leverages unlabelled images from the domain in which testing is occurring. All models are able to perform uncertainty estimation similarly well for Cityscapes, however when tested on the distributionally-shifted target domains, $\texttt{Mask-d2}$'s performance exceeds that of the baselines. The gap in $\mathrm{MaxF_{0.5}}$ score between $\texttt{Mask-d2}$ and $\texttt{MaxS-d2}$, MaxS-d2-E$^{*}$ is descriptive of the benefit of our proposed uncertainty training.
  • Figure 4: Qualitative results for the proposed Mask-d2 model, presenting (left) an RGB image, (middle) the semantic segmentation and (right) the estimated uncertainty in the $\mathtt{jet}$ colour map, where red is uncertain and blue is certain. For each distributionally-shifted image, the incorrect segmentations are effectively detected by the model's estimated uncertainty. The variant of Mask-d2 model used is that which was trained on the same domains as the test image shown (see caption).