Can OOD Object Detectors Learn from Foundation Models?

Jiahui Liu; Xin Wen; Shizhen Zhao; Yingxian Chen; Xiaojuan Qi

Can OOD Object Detectors Learn from Foundation Models?

Jiahui Liu, Xin Wen, Shizhen Zhao, Yingxian Chen, Xiaojuan Qi

TL;DR

This work tackles out-of-distribution (OOD) object detection under limited access to open-set data by distilling open-world knowledge from foundation models. It introduces SyncOOD, a fully automatic pipeline that imagines semantic-novel yet visually similar concepts for ID objects using an LLM, edits scene regions with Stable Diffusion, and refines annotations with SAM to produce high-quality OOD data. A lightweight OOD head is trained on pseudo-OOD samples selected for high visual similarity to ID counterparts, optimizing the ID/OOD decision boundary with minimal synthetic data. Across Pascal-VOC, BDD-100K, MS-COCO, and OpenImages benchmarks, SyncOOD achieves state-of-the-art FPR_{95} and AUROC, with ablations showing the critical roles of scene-level editing, annotation quality, and context consistency for effective open-world detection.

Abstract

Out-of-distribution (OOD) object detection is a challenging task due to the absence of open-set OOD data. Inspired by recent advancements in text-to-image generative models, such as Stable Diffusion, we study the potential of generative models trained on large-scale open-set data to synthesize OOD samples, thereby enhancing OOD object detection. We introduce SyncOOD, a simple data curation method that capitalizes on the capabilities of large foundation models to automatically extract meaningful OOD data from text-to-image generative models. This offers the model access to open-world knowledge encapsulated within off-the-shelf foundation models. The synthetic OOD samples are then employed to augment the training of a lightweight, plug-and-play OOD detector, thus effectively optimizing the in-distribution (ID)/OOD decision boundaries. Extensive experiments across multiple benchmarks demonstrate that SyncOOD significantly outperforms existing methods, establishing new state-of-the-art performance with minimal synthetic data usage.

Can OOD Object Detectors Learn from Foundation Models?

TL;DR

Abstract

Paper Structure (35 sections, 6 equations, 4 figures, 5 tables)

This paper contains 35 sections, 6 equations, 4 figures, 5 tables.

Introduction
Related Work
OOD Object Detection
Open-world Object Detection
OOD Image Classification
Foundation Models
Method
Preliminary
Overview
Synthesizing Semantic-novel Objects in Scene Images
Imagining Novel Concepts from ID objects
Editing Objects on Selected Regions
Refining Annotation Boxes of Novel Objects
Mining Hard OOD Samples and Model Training
Mining Hard OOD Objects with High Visual Similarities for Training
...and 20 more sections

Figures (4)

Figure 1: Our pipeline replaces ID objects with semantic-novel yet visual-similar objects for scene-level OOD object synthesis. Middle left: The concepts are imagined by an LLM to ensure semantic separability and rationality, and reformed as text prompts for controllable in-painting using Stable Diffusion. Middle right: During training, only visually similar OOD objects are adopted based on instance-level feature similarity to the original object. A lightweight binary classifier is optimized for OOD detection, and other parts of the detector are kept unchanged.
Figure 2: Detailed illustration of our outlier synthesis pipeline. It comprises (a) Instructing an LLM to imagine semantic-novel concepts given ID objects, (b) Editing the selected regions to the expected concepts via prompt-conditioned image inpainting using Stable Diffusion, and (c) Refining the bounding boxes of edited objects using SAM.
Figure 3: We show cases on six intervals of feature similarity (consistent with \ref{['eq:sim']}, indicated at the bottom of the figure). The first line contains the corresponding initial images, the second line contains the synthetic images with the corresponding boxes of the novel objects (yellow boxes), and the third line contains the difference heat maps of the latent feature maps extracted from the above image pairs (superimposed on the corresponding synthetic images, denoted as $\text{Diff-map}$).
Figure 4: We edit the context of the synthetic data (in blue box) so that the images contain novel objects and novel context (in orange box). Then we calculate the similarity between the instance-level feature of the corresponding objects from all synthetic images and the instance-level feature in the initial image (left, the bird). The similarities and the corresponding difference maps are shown in the figure.

Can OOD Object Detectors Learn from Foundation Models?

TL;DR

Abstract

Can OOD Object Detectors Learn from Foundation Models?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)