Table of Contents
Fetching ...

Leveraging Text-Driven Semantic Variation for Robust OOD Segmentation

Seungheon Song, Jaekoo Lee

TL;DR

This work addresses the challenge of robust OOD segmentation in autonomous driving by leveraging vision-language knowledge. It introduces a Text-Driven OOD Segmentation framework that fuses a CLIP-based vision-language encoder with Mask2Former, guided by textual queries of both ID and OOD concepts. The method comprises Distance-Based OOD Prompts—grouped by semantic distance from ID classes and paired with learnable prompts—and OOD Semantic Augmentation via Semantically Augmented Attention to diversify OOD representations. Backbone regularization losses preserve vision-language alignment during training. Empirical results on Fishyscapes, SMIYC, and Road Anomaly show state-of-the-art performance at both pixel- and object-level evaluations, illustrating strong generalization to unseen anomalies and the practical potential for safer autonomous driving systems.

Abstract

In autonomous driving and robotics, ensuring road safety and reliable decision-making critically depends on out-of-distribution (OOD) segmentation. While numerous methods have been proposed to detect anomalous objects on the road, leveraging the vision-language space-which provides rich linguistic knowledge-remains an underexplored field. We hypothesize that incorporating these linguistic cues can be especially beneficial in the complex contexts found in real-world autonomous driving scenarios. To this end, we present a novel approach that trains a Text-Driven OOD Segmentation model to learn a semantically diverse set of objects in the vision-language space. Concretely, our approach combines a vision-language model's encoder with a transformer decoder, employs Distance-Based OOD prompts located at varying semantic distances from in-distribution (ID) classes, and utilizes OOD Semantic Augmentation for OOD representations. By aligning visual and textual information, our approach effectively generalizes to unseen objects and provides robust OOD segmentation in diverse driving environments. We conduct extensive experiments on publicly available OOD segmentation datasets such as Fishyscapes, Segment-Me-If-You-Can, and Road Anomaly datasets, demonstrating that our approach achieves state-of-the-art performance across both pixel-level and object-level evaluations. This result underscores the potential of vision-language-based OOD segmentation to bolster the safety and reliability of future autonomous driving systems.

Leveraging Text-Driven Semantic Variation for Robust OOD Segmentation

TL;DR

This work addresses the challenge of robust OOD segmentation in autonomous driving by leveraging vision-language knowledge. It introduces a Text-Driven OOD Segmentation framework that fuses a CLIP-based vision-language encoder with Mask2Former, guided by textual queries of both ID and OOD concepts. The method comprises Distance-Based OOD Prompts—grouped by semantic distance from ID classes and paired with learnable prompts—and OOD Semantic Augmentation via Semantically Augmented Attention to diversify OOD representations. Backbone regularization losses preserve vision-language alignment during training. Empirical results on Fishyscapes, SMIYC, and Road Anomaly show state-of-the-art performance at both pixel- and object-level evaluations, illustrating strong generalization to unseen anomalies and the practical potential for safer autonomous driving systems.

Abstract

In autonomous driving and robotics, ensuring road safety and reliable decision-making critically depends on out-of-distribution (OOD) segmentation. While numerous methods have been proposed to detect anomalous objects on the road, leveraging the vision-language space-which provides rich linguistic knowledge-remains an underexplored field. We hypothesize that incorporating these linguistic cues can be especially beneficial in the complex contexts found in real-world autonomous driving scenarios. To this end, we present a novel approach that trains a Text-Driven OOD Segmentation model to learn a semantically diverse set of objects in the vision-language space. Concretely, our approach combines a vision-language model's encoder with a transformer decoder, employs Distance-Based OOD prompts located at varying semantic distances from in-distribution (ID) classes, and utilizes OOD Semantic Augmentation for OOD representations. By aligning visual and textual information, our approach effectively generalizes to unseen objects and provides robust OOD segmentation in diverse driving environments. We conduct extensive experiments on publicly available OOD segmentation datasets such as Fishyscapes, Segment-Me-If-You-Can, and Road Anomaly datasets, demonstrating that our approach achieves state-of-the-art performance across both pixel-level and object-level evaluations. This result underscores the potential of vision-language-based OOD segmentation to bolster the safety and reliability of future autonomous driving systems.

Paper Structure

This paper contains 17 sections, 11 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of limitations in existing OOD segmentation approaches and advantages of the proposed approach. (a) Existing methods often rely solely on visual information (vision-only), whereas our vision-language approach incorporates textual cues in addition to images. (b) By leveraging semantic information from text, our method learns clearer decision boundaries in the joint ID and OOD feature space. (c) Unlike uncertainty- or generation-based methods that use only visual cues, our approach leverages textual knowledge to achieve more reliable OOD scoring.
  • Figure 2: An overview architecture of the proposed method
  • Figure 3: Overview of strategies for improving Text-Driven OOD Segmentation: (a) Distance-Based OOD Prompts: Learns OOD prompts placed at various semantic distances from each ID label, thereby enhancing the model’s ability to handle diverse unknown categories. (b) Vision Regularization: Preserves the pretrained knowledge of the image encoder by minimizing deviations from its original vision-language alignment. (c) Vision-Language Regularization: Extends pixel-level vision-language knowledge in the VLM, enabling more comprehensive semantic understanding for improved OOD detection.
  • Figure 4: An overview of our proposed Semantically Augmented Attention$A_{\text{SAA}}$ mechanism
  • Figure 5: Comparison of OOD segmentation visualization results. The input image (highlighting the OOD object in red box) and its corresponding segmentation outputs illustrate that our method not only provides more refined OOD predictions but also exhibits fewer false positives and false negatives than recent alternative methods.