Table of Contents
Fetching ...

FodFoM: Fake Outlier Data by Foundation Models Creates Stronger Visual Out-of-Distribution Detector

Jiankang Chen, Ling Deng, Zhiyong Gan, Wei-Shi Zheng, Ruixuan Wang

TL;DR

A novel OOD detection framework FodFoM is proposed that innovatively combines multiple foundation models to generate two types of challenging fake outlier images for classifier training and shows that image classifiers with the help of constructed fake images can more accurately differentiate real OOD image from ID ones.

Abstract

Out-of-Distribution (OOD) detection is crucial when deploying machine learning models in open-world applications. The core challenge in OOD detection is mitigating the model's overconfidence on OOD data. While recent methods using auxiliary outlier datasets or synthesizing outlier features have shown promising OOD detection performance, they are limited due to costly data collection or simplified assumptions. In this paper, we propose a novel OOD detection framework FodFoM that innovatively combines multiple foundation models to generate two types of challenging fake outlier images for classifier training. The first type is based on BLIP-2's image captioning capability, CLIP's vision-language knowledge, and Stable Diffusion's image generation ability. Jointly utilizing these foundation models constructs fake outlier images which are semantically similar to but different from in-distribution (ID) images. For the second type, GroundingDINO's object detection ability is utilized to help construct pure background images by blurring foreground ID objects in ID images. The proposed framework can be flexibly combined with multiple existing OOD detection methods. Extensive empirical evaluations show that image classifiers with the help of constructed fake images can more accurately differentiate real OOD images from ID ones. New state-of-the-art OOD detection performance is achieved on multiple benchmarks. The code is available at \url{https://github.com/Cverchen/ACMMM2024-FodFoM}.

FodFoM: Fake Outlier Data by Foundation Models Creates Stronger Visual Out-of-Distribution Detector

TL;DR

A novel OOD detection framework FodFoM is proposed that innovatively combines multiple foundation models to generate two types of challenging fake outlier images for classifier training and shows that image classifiers with the help of constructed fake images can more accurately differentiate real OOD image from ID ones.

Abstract

Out-of-Distribution (OOD) detection is crucial when deploying machine learning models in open-world applications. The core challenge in OOD detection is mitigating the model's overconfidence on OOD data. While recent methods using auxiliary outlier datasets or synthesizing outlier features have shown promising OOD detection performance, they are limited due to costly data collection or simplified assumptions. In this paper, we propose a novel OOD detection framework FodFoM that innovatively combines multiple foundation models to generate two types of challenging fake outlier images for classifier training. The first type is based on BLIP-2's image captioning capability, CLIP's vision-language knowledge, and Stable Diffusion's image generation ability. Jointly utilizing these foundation models constructs fake outlier images which are semantically similar to but different from in-distribution (ID) images. For the second type, GroundingDINO's object detection ability is utilized to help construct pure background images by blurring foreground ID objects in ID images. The proposed framework can be flexibly combined with multiple existing OOD detection methods. Extensive empirical evaluations show that image classifiers with the help of constructed fake images can more accurately differentiate real OOD images from ID ones. New state-of-the-art OOD detection performance is achieved on multiple benchmarks. The code is available at \url{https://github.com/Cverchen/ACMMM2024-FodFoM}.

Paper Structure

This paper contains 29 sections, 6 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: OOD detection performance of different methods on the CIFAR100 and ImageNet100 benchmarks (see Section \ref{['datasets']} for benchmark details). $\diamondsuit$: post-hoc approaches. $\triangle$: training-based approaches. Ours belongs to the latter.
  • Figure 2: Illustration of the proposed FodFoM framework. FodFoM generates fake OOD images in two ways. ①: Text description of each image is generated by BLIP-2 and then the slightly augmented description with a class-specific prompt is mapped to the textual semantic space by CLIP's text encoder. The fake OOD text embeddings are then constructed using the proposed method and finally sent to Stable Diffusion to generate challenging fake OOD images. ②: The semantic regions of ID objects (detected by GroundingDINO) are blurred to obtain background images as fake OOD images. During training (③), fake OOD images are labeled with an additional class different from ID classes.
  • Figure 3: Generation of fake OOD text embeddings. Left: Construction of fake OOD embeddings in the demonstrative textual semantic space. Right: Distribution (blue) of cosine similarities between the mean text embedding of the class "Speedboat" and text embeddings ("true") of ID images from this class in ImageNet100, and the distribution (yellow) between the mean text embedding and the challenging fake OOD embeddings ("fake") for this class.
  • Figure 4: Sensitivity study of hyper-parameters $\lambda$, $\tau$, and $\alpha$. Dashed line: performance of the best baseline. Backbone: ResNet18.
  • Figure 5: Demonstrative fake OOD images generated from Stable Diffusion. ID images are from ImageNet100.
  • ...and 2 more figures