Table of Contents
Fetching ...

CLIP-driven Outliers Synthesis for few-shot OOD detection

Hao Sun, Rundong He, Zhongyi Han, Zhicong Lin, Yongshun Gong, Yilong Yin

TL;DR

The paper addresses the problem of robust few-shot OOD detection by leveraging CLIP to synthesize reliable OOD supervision signals from limited ID data. It introduces CLIP-OS, a three-stage framework that (1) extracts ID-relevant features via patch-context incorporation and adaptive CLIP-surgery-discrepancy masking, (2) generates cross-class OOD data by mixing ID-relevant features, and (3) regularizes the ID/OOD boundary with unknown-aware prompt learning. The approach yields significant performance gains on CIFAR-10, CIFAR-100, and ImageNet-100 in one- and two-shot settings, outperforming existing zero-shot and few-shot baselines and maintaining ID accuracy. By enabling reliable OOD supervision without large external datasets, CLIP-OS offers a practical pathway to safer deployment of vision-language systems in settings with scarce labeled data.

Abstract

Few-shot OOD detection focuses on recognizing out-of-distribution (OOD) images that belong to classes unseen during training, with the use of only a small number of labeled in-distribution (ID) images. Up to now, a mainstream strategy is based on large-scale vision-language models, such as CLIP. However, these methods overlook a crucial issue: the lack of reliable OOD supervision information, which can lead to biased boundaries between in-distribution (ID) and OOD. To tackle this problem, we propose CLIP-driven Outliers Synthesis~(CLIP-OS). Firstly, CLIP-OS enhances patch-level features' perception by newly proposed patch uniform convolution, and adaptively obtains the proportion of ID-relevant information by employing CLIP-surgery-discrepancy, thus achieving separation between ID-relevant and ID-irrelevant. Next, CLIP-OS synthesizes reliable OOD data by mixing up ID-relevant features from different classes to provide OOD supervision information. Afterward, CLIP-OS leverages synthetic OOD samples by unknown-aware prompt learning to enhance the separability of ID and OOD. Extensive experiments across multiple benchmarks demonstrate that CLIP-OS achieves superior few-shot OOD detection capability.

CLIP-driven Outliers Synthesis for few-shot OOD detection

TL;DR

The paper addresses the problem of robust few-shot OOD detection by leveraging CLIP to synthesize reliable OOD supervision signals from limited ID data. It introduces CLIP-OS, a three-stage framework that (1) extracts ID-relevant features via patch-context incorporation and adaptive CLIP-surgery-discrepancy masking, (2) generates cross-class OOD data by mixing ID-relevant features, and (3) regularizes the ID/OOD boundary with unknown-aware prompt learning. The approach yields significant performance gains on CIFAR-10, CIFAR-100, and ImageNet-100 in one- and two-shot settings, outperforming existing zero-shot and few-shot baselines and maintaining ID accuracy. By enabling reliable OOD supervision without large external datasets, CLIP-OS offers a practical pathway to safer deployment of vision-language systems in settings with scarce labeled data.

Abstract

Few-shot OOD detection focuses on recognizing out-of-distribution (OOD) images that belong to classes unseen during training, with the use of only a small number of labeled in-distribution (ID) images. Up to now, a mainstream strategy is based on large-scale vision-language models, such as CLIP. However, these methods overlook a crucial issue: the lack of reliable OOD supervision information, which can lead to biased boundaries between in-distribution (ID) and OOD. To tackle this problem, we propose CLIP-driven Outliers Synthesis~(CLIP-OS). Firstly, CLIP-OS enhances patch-level features' perception by newly proposed patch uniform convolution, and adaptively obtains the proportion of ID-relevant information by employing CLIP-surgery-discrepancy, thus achieving separation between ID-relevant and ID-irrelevant. Next, CLIP-OS synthesizes reliable OOD data by mixing up ID-relevant features from different classes to provide OOD supervision information. Afterward, CLIP-OS leverages synthetic OOD samples by unknown-aware prompt learning to enhance the separability of ID and OOD. Extensive experiments across multiple benchmarks demonstrate that CLIP-OS achieves superior few-shot OOD detection capability.
Paper Structure (29 sections, 9 equations, 5 figures, 4 tables)

This paper contains 29 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Few-shot OOD detection. We show some training data of ImageNet in four-shot setting and 8 test data from the test set of ImageNet and Textures. The goal is to correctly classify the ID samples while also detecting the OOD samples.
  • Figure 2: The overall framework of our proposed CLIP-driven Outliers Synthesis.
  • Figure 3: Visualization of extracted ID regions.
  • Figure 4: Sensitivity Experiment of the Parameter $\beta$. We report average AUROC scores on five OOD datasets.
  • Figure 5: Comparison in ID accuracy.