Table of Contents
Fetching ...

Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions

Kai Liu, Zhihang Fu, Chao Chen, Sheng Jin, Ze Chen, Mingyuan Tao, Rongxin Jiang, Jieping Ye

TL;DR

This work introduces two hierarchical contexts, namely perceptual context and spurious context, to carefully describe the precise category boundary through automatic prompt tuning, and offers new insights on how to efficiently scale up the prompt engineering in vision-language models to recognize thousands of object categories, as well as how to incorporate large language models to boost zero-shot applications.

Abstract

The key to OOD detection has two aspects: generalized feature representation and precise category description. Recently, vision-language models such as CLIP provide significant advances in both two issues, but constructing precise category descriptions is still in its infancy due to the absence of unseen categories. This work introduces two hierarchical contexts, namely perceptual context and spurious context, to carefully describe the precise category boundary through automatic prompt tuning. Specifically, perceptual contexts perceive the inter-category difference (e.g., cats vs apples) for current classification tasks, while spurious contexts further identify spurious (similar but exactly not) OOD samples for every single category (e.g., cats vs panthers, apples vs peaches). The two contexts hierarchically construct the precise description for a certain category, which is, first roughly classifying a sample to the predicted category and then delicately identifying whether it is truly an ID sample or actually OOD. Moreover, the precise descriptions for those categories within the vision-language framework present a novel application: CATegory-EXtensible OOD detection (CATEX). One can efficiently extend the set of recognizable categories by simply merging the hierarchical contexts learned under different sub-task settings. And extensive experiments are conducted to demonstrate CATEX's effectiveness, robustness, and category-extensibility. For instance, CATEX consistently surpasses the rivals by a large margin with several protocols on the challenging ImageNet-1K dataset. In addition, we offer new insights on how to efficiently scale up the prompt engineering in vision-language models to recognize thousands of object categories, as well as how to incorporate large language models (like GPT-3) to boost zero-shot applications. Code is publicly available at https://github.com/alibaba/catex.

Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions

TL;DR

This work introduces two hierarchical contexts, namely perceptual context and spurious context, to carefully describe the precise category boundary through automatic prompt tuning, and offers new insights on how to efficiently scale up the prompt engineering in vision-language models to recognize thousands of object categories, as well as how to incorporate large language models to boost zero-shot applications.

Abstract

The key to OOD detection has two aspects: generalized feature representation and precise category description. Recently, vision-language models such as CLIP provide significant advances in both two issues, but constructing precise category descriptions is still in its infancy due to the absence of unseen categories. This work introduces two hierarchical contexts, namely perceptual context and spurious context, to carefully describe the precise category boundary through automatic prompt tuning. Specifically, perceptual contexts perceive the inter-category difference (e.g., cats vs apples) for current classification tasks, while spurious contexts further identify spurious (similar but exactly not) OOD samples for every single category (e.g., cats vs panthers, apples vs peaches). The two contexts hierarchically construct the precise description for a certain category, which is, first roughly classifying a sample to the predicted category and then delicately identifying whether it is truly an ID sample or actually OOD. Moreover, the precise descriptions for those categories within the vision-language framework present a novel application: CATegory-EXtensible OOD detection (CATEX). One can efficiently extend the set of recognizable categories by simply merging the hierarchical contexts learned under different sub-task settings. And extensive experiments are conducted to demonstrate CATEX's effectiveness, robustness, and category-extensibility. For instance, CATEX consistently surpasses the rivals by a large margin with several protocols on the challenging ImageNet-1K dataset. In addition, we offer new insights on how to efficiently scale up the prompt engineering in vision-language models to recognize thousands of object categories, as well as how to incorporate large language models (like GPT-3) to boost zero-shot applications. Code is publicly available at https://github.com/alibaba/catex.
Paper Structure (25 sections, 5 equations, 15 figures, 12 tables)

This paper contains 25 sections, 5 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Method comparison. Compared to previous approaches, our method utilizes the perceptual context to classify different categories under the current ID task (solid lines), and leverages the spurious context to strictly define the category boundaries independent of the current setting (dashed lines). The hierarchical perceptual and spurious contexts jointly describe the precise and universal boundaries for each category (combination of solid and dashed lines).
  • Figure 2: Illustration of our method. Perceptual context perceives a certain ID category, and spurious context explicitly describes a spurious category around this ID category. Random perturbation is applied to the perceptual context for synthesizing outliers to train the non-trivial spurious context. The hierarchical perceptual and spurious contexts jointly describe the precise category boundary.
  • Figure 3: Guiding process.
  • Figure 4: Feature visualization by t-SNE. (a) Previous approaches that fine-tune the encoders may distort the generalized feature space and make unseen OOD samples inseparable; (b) instead, our method freezes the encoders to maintain the discriminability. (c) Compared with traditional prompting methods using a single perceptual context only, (d) our spurious context provides a better metric for unseen OOD detection.
  • Figure 5: Scaling up ImageNet-1K deng2009imagenet to ImageNet-21K deng2009imagenet with category-incremental learning.
  • ...and 10 more figures