Leveraging Text-Driven Semantic Variation for Robust OOD Segmentation
Seungheon Song, Jaekoo Lee
TL;DR
This work addresses the challenge of robust OOD segmentation in autonomous driving by leveraging vision-language knowledge. It introduces a Text-Driven OOD Segmentation framework that fuses a CLIP-based vision-language encoder with Mask2Former, guided by textual queries of both ID and OOD concepts. The method comprises Distance-Based OOD Prompts—grouped by semantic distance from ID classes and paired with learnable prompts—and OOD Semantic Augmentation via Semantically Augmented Attention to diversify OOD representations. Backbone regularization losses preserve vision-language alignment during training. Empirical results on Fishyscapes, SMIYC, and Road Anomaly show state-of-the-art performance at both pixel- and object-level evaluations, illustrating strong generalization to unseen anomalies and the practical potential for safer autonomous driving systems.
Abstract
In autonomous driving and robotics, ensuring road safety and reliable decision-making critically depends on out-of-distribution (OOD) segmentation. While numerous methods have been proposed to detect anomalous objects on the road, leveraging the vision-language space-which provides rich linguistic knowledge-remains an underexplored field. We hypothesize that incorporating these linguistic cues can be especially beneficial in the complex contexts found in real-world autonomous driving scenarios. To this end, we present a novel approach that trains a Text-Driven OOD Segmentation model to learn a semantically diverse set of objects in the vision-language space. Concretely, our approach combines a vision-language model's encoder with a transformer decoder, employs Distance-Based OOD prompts located at varying semantic distances from in-distribution (ID) classes, and utilizes OOD Semantic Augmentation for OOD representations. By aligning visual and textual information, our approach effectively generalizes to unseen objects and provides robust OOD segmentation in diverse driving environments. We conduct extensive experiments on publicly available OOD segmentation datasets such as Fishyscapes, Segment-Me-If-You-Can, and Road Anomaly datasets, demonstrating that our approach achieves state-of-the-art performance across both pixel-level and object-level evaluations. This result underscores the potential of vision-language-based OOD segmentation to bolster the safety and reliability of future autonomous driving systems.
