VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection
Bin Zhang, Xiaoyang Qu, Guokuan Li, Jiguang Wan, Jianzong Wang
TL;DR
VisTa tackles zero-shot object-level OOD detection by adapting CLIP through contextual visual prompts and a text-augmented ID embedding space. By preserving contextual cues around object crops and aligning textual prompts with visual emphasis, VisTa computes an uncertainty score from the similarity between object features and an enriched ID space, enabling robust ID/OOD separation without retraining detectors. Empirical results across VOC/BDD-100K ID splits and MSCOCO/OpenImages OOD splits show consistent improvements over prior zero-shot and unsupervised methods, with notable reductions in FPR95 and high AUROC. The approach offers a practical, scalable solution for deploying object detectors in open-world settings where training data access is limited.
Abstract
As object detectors are increasingly deployed as black-box cloud services or pre-trained models with restricted access to the original training data, the challenge of zero-shot object-level out-of-distribution (OOD) detection arises. This task becomes crucial in ensuring the reliability of detectors in open-world settings. While existing methods have demonstrated success in image-level OOD detection using pre-trained vision-language models like CLIP, directly applying such models to object-level OOD detection presents challenges due to the loss of contextual information and reliance on image-level alignment. To tackle these challenges, we introduce a new method that leverages visual prompts and text-augmented in-distribution (ID) space construction to adapt CLIP for zero-shot object-level OOD detection. Our method preserves critical contextual information and improves the ability to differentiate between ID and OOD objects, achieving competitive performance across different benchmarks.
