Table of Contents
Fetching ...

Language-Guided Open-World Anomaly Segmentation

Klara Reichard, Nikolas Brasch, Nassir Navab, Federico Tombari

TL;DR

Clipomaly introduces a zero-shot, CLIP-based framework for open-world anomaly segmentation in autonomous driving, enabling simultaneous segmentation of known objects and semantically meaningful labeling of unknown regions without anomaly-specific training. The method predicts unknown regions via dense CLIP embeddings, generates candidate labels through RAM or dictionary preselection, and matches regions to labels with region-aware CLIP scoring, finally performing open-vocabulary segmentation with an extended vocabulary. It achieves state-of-the-art anomaly segmentation on benchmarks like RoadAnomaly and SMIYC AnomalyTrack, while preserving accuracy on known classes and providing interpretable, dynamically extendable vocabulary at inference. The approach offers practical benefits for deployment by avoiding continual retraining and delivering human-readable anomaly names that can aid downstream planning and control.

Abstract

Open-world and anomaly segmentation methods seek to enable autonomous driving systems to detect and segment both known and unknown objects in real-world scenes. However, existing methods do not assign semantically meaningful labels to unknown regions, and distinguishing and learning representations for unknown classes remains difficult. While open-vocabulary segmentation methods show promise in generalizing to novel classes, they require a fixed inference vocabulary and thus cannot be directly applied to anomaly segmentation where unknown classes are unconstrained. We propose Clipomaly, the first CLIP-based open-world and anomaly segmentation method for autonomous driving. Our zero-shot approach requires no anomaly-specific training data and leverages CLIP's shared image-text embedding space to both segment unknown objects and assign human-interpretable names to them. Unlike open-vocabulary methods, our model dynamically extends its vocabulary at inference time without retraining, enabling robust detection and naming of anomalies beyond common class definitions such as those in Cityscapes. Clipomaly achieves state-of-the-art performance on established anomaly segmentation benchmarks while providing interpretability and flexibility essential for practical deployment.

Language-Guided Open-World Anomaly Segmentation

TL;DR

Clipomaly introduces a zero-shot, CLIP-based framework for open-world anomaly segmentation in autonomous driving, enabling simultaneous segmentation of known objects and semantically meaningful labeling of unknown regions without anomaly-specific training. The method predicts unknown regions via dense CLIP embeddings, generates candidate labels through RAM or dictionary preselection, and matches regions to labels with region-aware CLIP scoring, finally performing open-vocabulary segmentation with an extended vocabulary. It achieves state-of-the-art anomaly segmentation on benchmarks like RoadAnomaly and SMIYC AnomalyTrack, while preserving accuracy on known classes and providing interpretable, dynamically extendable vocabulary at inference. The approach offers practical benefits for deployment by avoiding continual retraining and delivering human-readable anomaly names that can aid downstream planning and control.

Abstract

Open-world and anomaly segmentation methods seek to enable autonomous driving systems to detect and segment both known and unknown objects in real-world scenes. However, existing methods do not assign semantically meaningful labels to unknown regions, and distinguishing and learning representations for unknown classes remains difficult. While open-vocabulary segmentation methods show promise in generalizing to novel classes, they require a fixed inference vocabulary and thus cannot be directly applied to anomaly segmentation where unknown classes are unconstrained. We propose Clipomaly, the first CLIP-based open-world and anomaly segmentation method for autonomous driving. Our zero-shot approach requires no anomaly-specific training data and leverages CLIP's shared image-text embedding space to both segment unknown objects and assign human-interpretable names to them. Unlike open-vocabulary methods, our model dynamically extends its vocabulary at inference time without retraining, enabling robust detection and naming of anomalies beyond common class definitions such as those in Cityscapes. Clipomaly achieves state-of-the-art performance on established anomaly segmentation benchmarks while providing interpretability and flexibility essential for practical deployment.

Paper Structure

This paper contains 36 sections, 16 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Teaser: Our method produces accurate known-class and detailed unknown-class segmentations with meaningful semantic labels.
  • Figure 2: Overview of our Method Clipomaly: For clarity this figure only shows our dictionary-based method together with CLIP-Best Region Matching.
  • Figure 3: Qualitative segmentation results of Ours-RAM on the SMIYC AnomalyTrack dataset using our extended vocabulary method. The top row shows the predicted segmentations, while the bottom row shows the corresponding input images.
  • Figure 4: Qualitative anomaly segmentation results comparison of our method Clipomaly. For the columns that show our method, generated names for anomalies are shown below the image.
  • Figure 5: Teaser: Our method produces accurate known-class and detailed unknown-class segmentations with meaningful semantic labels.
  • ...and 3 more figures