Table of Contents
Fetching ...

TagFog: Textual Anchor Guidance and Fake Outlier Generation for Visual Out-of-Distribution Detection

Jiankang Chen, Tong Zhang, Wei-Shi Zheng, Ruixuan Wang

TL;DR

TagFog presents a novel OOD detection framework that couples Jigsaw-based fake OOD generation with ChatGPT-derived textual anchors encoded by CLIP to train a vision encoder. By optimizing a joint objective that aligns image embeddings with rich textual anchors and applies SupCon-style constraints across ID and fake OOD samples, TagFog achieves state-of-the-art performance and remains compatible with post-hoc OOD scorers like ReAct. Extensive experiments on CIFAR-10/100 and ImageNet100 benchmarks, along with thorough ablations and sensitivity analyses, demonstrate robustness and the complementary value of textual guidance and fake OOD data. The approach offers a practical, flexible route to stronger OOD detection without requiring extra OOD labels, with clear potential for integration into real-world systems.

Abstract

Out-of-distribution (OOD) detection is crucial in many real-world applications. However, intelligent models are often trained solely on in-distribution (ID) data, leading to overconfidence when misclassifying OOD data as ID classes. In this study, we propose a new learning framework which leverage simple Jigsaw-based fake OOD data and rich semantic embeddings (`anchors') from the ChatGPT description of ID knowledge to help guide the training of the image encoder. The learning framework can be flexibly combined with existing post-hoc approaches to OOD detection, and extensive empirical evaluations on multiple OOD detection benchmarks demonstrate that rich textual representation of ID knowledge and fake OOD knowledge can well help train a visual encoder for OOD detection. With the learning framework, new state-of-the-art performance was achieved on all the benchmarks. The code is available at \url{https://github.com/Cverchen/TagFog}.

TagFog: Textual Anchor Guidance and Fake Outlier Generation for Visual Out-of-Distribution Detection

TL;DR

TagFog presents a novel OOD detection framework that couples Jigsaw-based fake OOD generation with ChatGPT-derived textual anchors encoded by CLIP to train a vision encoder. By optimizing a joint objective that aligns image embeddings with rich textual anchors and applies SupCon-style constraints across ID and fake OOD samples, TagFog achieves state-of-the-art performance and remains compatible with post-hoc OOD scorers like ReAct. Extensive experiments on CIFAR-10/100 and ImageNet100 benchmarks, along with thorough ablations and sensitivity analyses, demonstrate robustness and the complementary value of textual guidance and fake OOD data. The approach offers a practical, flexible route to stronger OOD detection without requiring extra OOD labels, with clear potential for integration into real-world systems.

Abstract

Out-of-distribution (OOD) detection is crucial in many real-world applications. However, intelligent models are often trained solely on in-distribution (ID) data, leading to overconfidence when misclassifying OOD data as ID classes. In this study, we propose a new learning framework which leverage simple Jigsaw-based fake OOD data and rich semantic embeddings (`anchors') from the ChatGPT description of ID knowledge to help guide the training of the image encoder. The learning framework can be flexibly combined with existing post-hoc approaches to OOD detection, and extensive empirical evaluations on multiple OOD detection benchmarks demonstrate that rich textual representation of ID knowledge and fake OOD knowledge can well help train a visual encoder for OOD detection. With the learning framework, new state-of-the-art performance was achieved on all the benchmarks. The code is available at \url{https://github.com/Cverchen/TagFog}.

Paper Structure

This paper contains 11 sections, 4 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: OOD detection performance of different methods on the CIFAR100 and ImageNet100-I benchmarks.
  • Figure 2: Overview of the proposed learning framework TagFog for OOD detection. Upper part: fake OOD data are generated based on the Jigsaw strategy and, together with the ID data, used to train the image encoder $f$ and the classifier head $h$. Lower part: the description of each ID class from ChatGPT is fed to the pretrained and fixed CLIP's Text Encoder to obtain the semantic embedding as anchor for the ID class. The anchors are used to guide the training of the image encoder based on the contrastive loss $\mathcal{L}_{CI}$ and $\mathcal{L}_{SC}$.
  • Figure 3: Ablation study of the text-guided learning on CIFAR10 and CIFAR100 benchmarks with backbone ResNet-18. All values are the average performance on the six OOD datasets. The proposed text-guided learning ('ChatGPT') is better than its two ablated versions.
  • Figure 4: Sensitivity study of hyper-parameters $\tau$ and $\tau'$, $\lambda_1$ and $\lambda_2$, and the number of fake OOD data. All experiments are on the CIFAR100 benchmark with model backbone ResNet18. The dashed line indicates the performance of the best baseline. Last subfigure: y-axis represents the standard deviation (std) of performance (A and F), x-axis represents five hyper-parameters, where N represents the number of fake OOD data.