Table of Contents
Fetching ...

VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection

Bin Zhang, Xiaoyang Qu, Guokuan Li, Jiguang Wan, Jianzong Wang

TL;DR

VisTa tackles zero-shot object-level OOD detection by adapting CLIP through contextual visual prompts and a text-augmented ID embedding space. By preserving contextual cues around object crops and aligning textual prompts with visual emphasis, VisTa computes an uncertainty score from the similarity between object features and an enriched ID space, enabling robust ID/OOD separation without retraining detectors. Empirical results across VOC/BDD-100K ID splits and MSCOCO/OpenImages OOD splits show consistent improvements over prior zero-shot and unsupervised methods, with notable reductions in FPR95 and high AUROC. The approach offers a practical, scalable solution for deploying object detectors in open-world settings where training data access is limited.

Abstract

As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detection using pre-trained vision-language models like CLIP, directly applying such models to object-level OOD detection presents challenges due to the loss of contextual information and reliance on image-level alignment. To tackle these challenges, we introduce a new method that leverages visual prompts and text-augmented in-distribution (ID) space construction to adapt CLIP for zero-shot object-level OOD detection. Our method preserves critical contextual information and improves the ability to differentiate between ID and OOD objects, achieving competitive performance across different benchmarks.

VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection

TL;DR

VisTa tackles zero-shot object-level OOD detection by adapting CLIP through contextual visual prompts and a text-augmented ID embedding space. By preserving contextual cues around object crops and aligning textual prompts with visual emphasis, VisTa computes an uncertainty score from the similarity between object features and an enriched ID space, enabling robust ID/OOD separation without retraining detectors. Empirical results across VOC/BDD-100K ID splits and MSCOCO/OpenImages OOD splits show consistent improvements over prior zero-shot and unsupervised methods, with notable reductions in FPR95 and high AUROC. The approach offers a practical, scalable solution for deploying object detectors in open-world settings where training data access is limited.

Abstract

As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detection using pre-trained vision-language models like CLIP, directly applying such models to object-level OOD detection presents challenges due to the loss of contextual information and reliance on image-level alignment. To tackle these challenges, we introduce a new method that leverages visual prompts and text-augmented in-distribution (ID) space construction to adapt CLIP for zero-shot object-level OOD detection. Our method preserves critical contextual information and improves the ability to differentiate between ID and OOD objects, achieving competitive performance across different benchmarks.

Paper Structure

This paper contains 11 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Framework of (a) the naive CLIP-based method and (b) the proposed VisTa approach. In (a), zero-shot CLIP-based OOD detection is directly adapted for object-level OOD detection using cropping. In (b), we emphasize the two key components of our VisTa method.
  • Figure 2: Overview of VisTa. The ID embedding space $\mathcal{C}$ is built with CLIP’s text encoder and augmented prompts. Image embeddings from contextual visual prompts are compared to $\mathcal{C}$ to compute similarity scores, which are used to calculate uncertainty and distinguish between ID and OOD samples.
  • Figure 3: Impact of different temperature parameter $\tau$.
  • Figure 4: Impact of different visual prompts. We report average results. Text-augmented ID space is constructed with corresponding visual prompts.