Vision-Language Feature Alignment for Road Anomaly Segmentation

Zhuolin He; Jiacheng Tang; Jian Pu; Xiangyang Xue

Vision-Language Feature Alignment for Road Anomaly Segmentation

Zhuolin He, Jiacheng Tang, Jian Pu, Xiangyang Xue

TL;DR

VL-Anomaly is proposed, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs) and a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions.

Abstract

Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes.Code is released on https://github.com/NickHezhuolin/VL-aligner-Road-anomaly-segment.

Vision-Language Feature Alignment for Road Anomaly Segmentation

TL;DR

Abstract

Paper Structure (27 sections, 10 equations, 5 figures, 5 tables)

This paper contains 27 sections, 10 equations, 5 figures, 5 tables.

INTRODUCTION
Related Work
Road Anomaly Segmentation
Vision-Language Models for Semantic Segmentation
Method
Preliminaries
Text Prompt Construction
Prompt Learning-Driven Aligner
Pixel-level alignment
Mask-level alignment
Alignment loss
Multi-source Inference Strategy
Detector confidence
Text-guided similarity
CLIP-based image-text similarity
...and 12 more sections

Figures (5)

Figure 1: Comparison of Anomaly Score Maps. The first row shows the original images, the second row presents anomaly score maps generated by Mask2Anomaly, and the third row illustrates the results of our method. Our approach yields cleaner maps by suppressing false positives on semantically normal background regions such as road surface and vegetation, while more precisely highlighting true anomalies like animals.
Figure 2: Overall architecture of VL-Anomaly. The framework integrates a segmentation backbone with CLIP-based vision–language modules. During training, the Prompt Learning-Driven Aligner (PL-Aligner) first performs pixel-level alignment between the backbone’s visual features and CLIP text embeddings of known categories, and then further establishes mask-level alignment with the decoder’s mask queries. During inference, multi-source scores from the segmentation model outputs, text-guided similarity and CLIP-based image-text similarity are fused to produce robust anomaly segmentation results.
Figure 3: Architecture of PL-Aligner. The first layer aligns pixel-level visual features from the backbone with text embeddings, while the second layer aligns mask queries from the decoder with the pixel-level features from the first layer to achieve mask-level alignment. Standard operations such as normalization and activation functions are omitted for clarity.
Figure 4: Qualitative comparison of anomaly segmentation results on the Road Anomaly dataset lis2019detecting. We compare the outlier score maps predicted by our method with those generated by MSP hendrycks2017a and Mask2Anomaly rai2023unmasking, using the same backbone for a fair comparison. For visualization, all scores are normalized to the same range. Our method more effectively suppresses false positives in semantically normal background regions, while competing approaches often yield blurred or spurious activations in these areas.
Figure 5: Visualization of the similarity between image features and the constructed text prompts. The highlighted areas show where the model associates image regions with specific semantic categories, demonstrating the effectiveness of our prompt learning strategy in guiding cross-modal alignment.

Vision-Language Feature Alignment for Road Anomaly Segmentation

TL;DR

Abstract

Vision-Language Feature Alignment for Road Anomaly Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)