Table of Contents
Fetching ...

VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection

Liangyu Zhong, Joachim Sicking, Fabian Hüger, Hanno Gottschalk

TL;DR

This work proposes to incorporate Vision-Language encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness, and proposes a new scoring function that enables data- and training-free outlier supervision via textual prompts.

Abstract

Semantic segmentation networks have achieved significant success under the assumption of independent and identically distributed data. However, these networks often struggle to detect anomalies from unknown semantic classes due to the limited set of visual concepts they are typically trained on. To address this issue, anomaly segmentation often involves fine-tuning on outlier samples, necessitating additional efforts for data collection, labeling, and model retraining. Seeking to avoid this cumbersome work, we take a different approach and propose to incorporate Vision-Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness. Additionally, we propose a new scoring function that enables data- and training-free outlier supervision via textual prompts. The resulting VL4AD model, which includes max-logit prompt ensembling and a class-merging strategy, achieves competitive performance on widely used benchmark datasets, thereby demonstrating the potential of vision-language models for pixel-wise anomaly detection.

VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection

TL;DR

This work proposes to incorporate Vision-Language encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness, and proposes a new scoring function that enables data- and training-free outlier supervision via textual prompts.

Abstract

Semantic segmentation networks have achieved significant success under the assumption of independent and identically distributed data. However, these networks often struggle to detect anomalies from unknown semantic classes due to the limited set of visual concepts they are typically trained on. To address this issue, anomaly segmentation often involves fine-tuning on outlier samples, necessitating additional efforts for data collection, labeling, and model retraining. Seeking to avoid this cumbersome work, we take a different approach and propose to incorporate Vision-Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness. Additionally, we propose a new scoring function that enables data- and training-free outlier supervision via textual prompts. The resulting VL4AD model, which includes max-logit prompt ensembling and a class-merging strategy, achieves competitive performance on widely used benchmark datasets, thereby demonstrating the potential of vision-language models for pixel-wise anomaly detection.
Paper Structure (28 sections, 4 equations, 9 figures, 9 tables)

This paper contains 28 sections, 4 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Showcasing the favorable ID-OOD data separation of a CLIP Radford2021ICML image encoder (right) compared to the backbones of vision-only ResNet50 networks He2016CVPR (left, middle). We use t-SNE Maaten2008JMLR to visualize the embedding vectors of images from ImageNet-200 zhang2023arxiv (orange points) and OOD samples from the NINCO dataset bitterwolf2023ICML (light blue points). OOD samples used for fine-tuning the ResNet50 model (middle) are shown as dark green triangles.
  • Figure 2: Our VL4AD approach uses the FC-CLIP architecture Yu2023NIPS It comprises frozen CLIP text and vision encoders paired with a Mask2Former (M2F) decoder. The model accepts visual inputs along with ID and optional OOD class prompts, providing pixel-wise uncertainty scores for anomaly detection.
  • Figure 3: Comparison of VL4AD (bottom, ours) with RbA (middle) on four sample images (top) Challenging OOD cases such as distant cows, airplanes and boat trailers are recognized with a notably cleaner and much more complete appearance (see white ellipses). While both methods successfully detect a flock of sheep as OOD (rightmost column), VL4AD produces far fewer false positives, such as misidentifying the road as an anomaly. Yellow indicates high ID class uncertainty (outliers), whereas blue signifies low ID class uncertainty (ID areas).
  • Figure 4: Comparison of VL4AD predictions without and with OOD prompting for an ID (left), far-OOD (middle) and near-OOD (right) input We assume a simplified setup with three ID classes (human, car, truck) and two OOD classes (animal, caravan). For an ID input (left), the model correctly predicts the class both without and with OOD prompts. For far-OOD (middle), the model also works well in both cases, however, using OOD prompts it puts significantly less weight on the (wrong) ID classes. For near-OOD, finally, the introduction of OOD classes is crucial as this way the erroneous classification of the input as ID (see panel c) can be avoided (panel f). Please note that an input is considered OOD when all its ID class probabilities (negative uncertainties) are below the decision threshold (dashed horizontal line). For further details, see Section \ref{['method_ood_prompt']}.
  • Figure 5: ID pixel retention rate on CityScapes as a function of OOD recall (RA19/FS LaF) VL4AD achieves a recall of 0.9 on both RA19 and FS LaF while correctly identifying at least $97\%$ of the CityScapes pixels as ID. Additionally, OOD prompting further enhances the ID retention rate on RA19.
  • ...and 4 more figures