Table of Contents
Fetching ...

UMAD: Unsupervised Mask-Level Anomaly Detection for Autonomous Driving

Daniel Bogdoll, Noël Ollick, Tim Joseph, Svetlana Pavlitska, J. Marius Zöllner

TL;DR

This work presents UMAD, the first fully unsupervised mask-level anomaly detection method for autonomous driving, combining a multimodal world model (MUVO) with unsupervised image segmentation (U2Seg) to detect anomalies without exposure to outliers. The approach computes pixel- and mask-level anomaly scores using diverse difference maps, including visual, perceptual, and temporal cues, and refines these scores through segmentation masks. On the AnoVox benchmark, UMAD demonstrates substantial improvements over a state-of-the-art unsupervised baseline, achieving a notable reduction in false positives at high true-positive rates and establishing a new baseline for unsupervised anomaly detection in driving scenarios. The work also conducts extensive ablations, highlighting the value of mask-based refinement and the impact of segmentation choices, while acknowledging limitations related to reconstruction quality and domain shifts in unsupervised segmentation.

Abstract

Dealing with atypical traffic scenarios remains a challenging task in autonomous driving. However, most anomaly detection approaches cannot be trained on raw sensor data but require exposure to outlier data and powerful semantic segmentation models trained in a supervised fashion. This limits the representation of normality to labeled data, which does not scale well. In this work, we revisit unsupervised anomaly detection and present UMAD, leveraging generative world models and unsupervised image segmentation. Our method outperforms state-of-the-art unsupervised anomaly detection.

UMAD: Unsupervised Mask-Level Anomaly Detection for Autonomous Driving

TL;DR

This work presents UMAD, the first fully unsupervised mask-level anomaly detection method for autonomous driving, combining a multimodal world model (MUVO) with unsupervised image segmentation (U2Seg) to detect anomalies without exposure to outliers. The approach computes pixel- and mask-level anomaly scores using diverse difference maps, including visual, perceptual, and temporal cues, and refines these scores through segmentation masks. On the AnoVox benchmark, UMAD demonstrates substantial improvements over a state-of-the-art unsupervised baseline, achieving a notable reduction in false positives at high true-positive rates and establishing a new baseline for unsupervised anomaly detection in driving scenarios. The work also conducts extensive ablations, highlighting the value of mask-based refinement and the impact of segmentation choices, while acknowledging limitations related to reconstruction quality and domain shifts in unsupervised segmentation.

Abstract

Dealing with atypical traffic scenarios remains a challenging task in autonomous driving. However, most anomaly detection approaches cannot be trained on raw sensor data but require exposure to outlier data and powerful semantic segmentation models trained in a supervised fashion. This limits the representation of normality to labeled data, which does not scale well. In this work, we revisit unsupervised anomaly detection and present UMAD, leveraging generative world models and unsupervised image segmentation. Our method outperforms state-of-the-art unsupervised anomaly detection.
Paper Structure (6 sections, 5 equations, 2 figures, 2 tables)

This paper contains 6 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of UMAD. First, multimodal sensor data is fed into a world model to reconstruct and predict frames, and semantic masks are derived from camera data. Visual Differences are used to compare a reconstruction of the current observation to the accompanying sensor data frame. The Temporal Difference, on the contrary, solely compares multiple future predictions from the world model. The pixel-wise scores are then fused and the resulting anomaly map is refined based on the generated masks.
  • Figure 2: Exemplary Detections. The first columns show the input image and the corresponding ground truth. MUVO reconstructions are utilized to generate difference maps, which are finally refined to mask-level maps. Masks are generated by the unsupervised segmentation model U2Seg. The first two rows show positive cases, while the third row shows a failure case.