Table of Contents
Fetching ...

Detecting Out-of-Distribution Objects through Class-Conditioned Inpainting

Quang-Huy Nguyen, Jin Peng Zhou, Zhenzhen Liu, Khanh-Huyen Bui, Kilian Q. Weinberger, Wei-Lun Chao, Dung D. Le

TL;DR

This work tackles the problem of detecting out-of-distribution (OOD) objects in object detectors, where overconfidence on unseen categories undermines trust. It introduces RONIN, a post-hoc, zero-shot framework that performs class-conditioned inpainting on detected objects using off-the-shelf diffusion models and assesses OOD status with a vision-language triplet similarity score. The key contribution is the S_triplet metric, which combines visual and semantic alignments to distinguish ID from OOD objects, augmented by near-OOD refinement prompts for closely related categories. Experiments across VOC, BDD100k, COCO, and OpenImages show that RONIN often surpasses zero-shot and non-zero-shot baselines, with robustness across diffusion models and detector types, making it practical for offline post-processing in dynamic environments.

Abstract

Recent object detectors have achieved impressive accuracy in identifying objects seen during training. However, real-world deployment often introduces novel and unexpected objects, referred to as out-of-distribution (OOD) objects, posing significant challenges to model trustworthiness. Modern object detectors are typically overconfident, making it unreliable to use their predictions alone for OOD detection. To address this, we propose leveraging an auxiliary model as a complementary solution. Specifically, we utilize an off-the-shelf text-to-image generative model, such as Stable Diffusion, which is trained with objective functions distinct from those of discriminative object detectors. We hypothesize that this fundamental difference enables the detection of OOD objects by measuring inconsistencies between the models. Concretely, for a given detected object bounding box and its predicted in-distribution class label, we perform class-conditioned inpainting on the image with the object removed. If the object is OOD, the inpainted image is likely to deviate significantly from the original, making the reconstruction error a robust indicator of OOD status. Extensive experiments demonstrate that our approach consistently surpasses existing zero-shot and non-zero-shot OOD detection methods, establishing a robust framework for enhancing object detection systems in dynamic environments.

Detecting Out-of-Distribution Objects through Class-Conditioned Inpainting

TL;DR

This work tackles the problem of detecting out-of-distribution (OOD) objects in object detectors, where overconfidence on unseen categories undermines trust. It introduces RONIN, a post-hoc, zero-shot framework that performs class-conditioned inpainting on detected objects using off-the-shelf diffusion models and assesses OOD status with a vision-language triplet similarity score. The key contribution is the S_triplet metric, which combines visual and semantic alignments to distinguish ID from OOD objects, augmented by near-OOD refinement prompts for closely related categories. Experiments across VOC, BDD100k, COCO, and OpenImages show that RONIN often surpasses zero-shot and non-zero-shot baselines, with robustness across diffusion models and detector types, making it practical for offline post-processing in dynamic environments.

Abstract

Recent object detectors have achieved impressive accuracy in identifying objects seen during training. However, real-world deployment often introduces novel and unexpected objects, referred to as out-of-distribution (OOD) objects, posing significant challenges to model trustworthiness. Modern object detectors are typically overconfident, making it unreliable to use their predictions alone for OOD detection. To address this, we propose leveraging an auxiliary model as a complementary solution. Specifically, we utilize an off-the-shelf text-to-image generative model, such as Stable Diffusion, which is trained with objective functions distinct from those of discriminative object detectors. We hypothesize that this fundamental difference enables the detection of OOD objects by measuring inconsistencies between the models. Concretely, for a given detected object bounding box and its predicted in-distribution class label, we perform class-conditioned inpainting on the image with the object removed. If the object is OOD, the inpainted image is likely to deviate significantly from the original, making the reconstruction error a robust indicator of OOD status. Extensive experiments demonstrate that our approach consistently surpasses existing zero-shot and non-zero-shot OOD detection methods, establishing a robust framework for enhancing object detection systems in dynamic environments.
Paper Structure (18 sections, 3 equations, 7 figures, 9 tables)

This paper contains 18 sections, 3 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Intuition behind RONIN for OOD detection. Object detectors can overconfidently make wrong predictions on unseen objects based on a pre-defined set of ID labels; in this case, predicting two OOD deers as "sheep". RONIN leverages the label predictions to condition the resynthesizing process of an off-the-shelf text-to-image diffusion model, producing similar inpaintings for correct predictions and dissimilar inpaintings for incorrect ones. Under a similarity measurement, RONIN can identify the wrong prediction of unseen objects, therefore ODD detection.
  • Figure 2: Overall framework of RONIN, including (i) Class-conditioned Inpainting and (ii) Vision-language Triplet Measurement. Given an image with bounding boxes and predicted labels, RONIN masks and reconstructs objects via inpainting, then evaluates alignment using triplet similarity for zero-shot OOD detection.
  • Figure 3: Triplet similarity relationships between (i) the original object, (ii) the inpainted outcome, and (iii) the predicted label. ID samples show strong alignments across all three, whereas OOD samples exhibit weak alignments, aiding effective OOD detection.
  • Figure 4: Side-by-side quantitative visualization. Good cases show synthesized objects consistent with predicted labels and clear OOD score separation; bad cases show inpainting failures or OOD objects too resembling the originals, leading to ineffective OOD score.
  • Figure 5: RONIN performance on near-OOD with refined prompting. With distinct inpainting by providing more context, RONIN is able to yield lower scores for near-OOD detection.
  • ...and 2 more figures