Table of Contents
Fetching ...

RUNA: Object-level Out-of-Distribution Detection via Regional Uncertainty Alignment of Multimodal Representations

Bin Zhang, Jinggang Chen, Xiaoyang Qu, Guokuan Li, Kai Lu, Jiguang Wan, Jing Xiao, Jianzong Wang

TL;DR

This work targets object-level OOD detection by leveraging pre-trained vision-language representations. It introduces RUNA, a dual-encoder framework that fuses global and region-focused image features and aligns regional uncertainty with an ID semantic space built from CLIP text embeddings. A few-shot fine-tuning strategy further sharpens the discrimination between ID and OOD objects by locally aligning region embeddings to ID concepts, while a post-hoc regional uncertainty metric converts detection uncertainty into a distance from the ID space. Empirical results on VOC/BDD-100K as ID with OpenImages and MSCOCO as OOD demonstrate clear gains over state-of-the-art methods, highlighting the practical value of regional CLIP-based alignment for robust, detector-agnostic OOD detection in complex scenes.

Abstract

Enabling object detectors to recognize out-of-distribution (OOD) objects is vital for building reliable systems. A primary obstacle stems from the fact that models frequently do not receive supervisory signals from unfamiliar data, leading to overly confident predictions regarding OOD objects. Despite previous progress that estimates OOD uncertainty based on the detection model and in-distribution (ID) samples, we explore using pre-trained vision-language representations for object-level OOD detection. We first discuss the limitations of applying image-level CLIP-based OOD detection methods to object-level scenarios. Building upon these insights, we propose RUNA, a novel framework that leverages a dual encoder architecture to capture rich contextual information and employs a regional uncertainty alignment mechanism to distinguish ID from OOD objects effectively. We introduce a few-shot fine-tuning approach that aligns region-level semantic representations to further improve the model's capability to discriminate between similar objects. Our experiments show that RUNA substantially surpasses state-of-the-art methods in object-level OOD detection, particularly in challenging scenarios with diverse and complex object instances.

RUNA: Object-level Out-of-Distribution Detection via Regional Uncertainty Alignment of Multimodal Representations

TL;DR

This work targets object-level OOD detection by leveraging pre-trained vision-language representations. It introduces RUNA, a dual-encoder framework that fuses global and region-focused image features and aligns regional uncertainty with an ID semantic space built from CLIP text embeddings. A few-shot fine-tuning strategy further sharpens the discrimination between ID and OOD objects by locally aligning region embeddings to ID concepts, while a post-hoc regional uncertainty metric converts detection uncertainty into a distance from the ID space. Empirical results on VOC/BDD-100K as ID with OpenImages and MSCOCO as OOD demonstrate clear gains over state-of-the-art methods, highlighting the practical value of regional CLIP-based alignment for robust, detector-agnostic OOD detection in complex scenes.

Abstract

Enabling object detectors to recognize out-of-distribution (OOD) objects is vital for building reliable systems. A primary obstacle stems from the fact that models frequently do not receive supervisory signals from unfamiliar data, leading to overly confident predictions regarding OOD objects. Despite previous progress that estimates OOD uncertainty based on the detection model and in-distribution (ID) samples, we explore using pre-trained vision-language representations for object-level OOD detection. We first discuss the limitations of applying image-level CLIP-based OOD detection methods to object-level scenarios. Building upon these insights, we propose RUNA, a novel framework that leverages a dual encoder architecture to capture rich contextual information and employs a regional uncertainty alignment mechanism to distinguish ID from OOD objects effectively. We introduce a few-shot fine-tuning approach that aligns region-level semantic representations to further improve the model's capability to discriminate between similar objects. Our experiments show that RUNA substantially surpasses state-of-the-art methods in object-level OOD detection, particularly in challenging scenarios with diverse and complex object instances.

Paper Structure

This paper contains 13 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Object detectors in the open world tend to make erroneous decisions when facing unknown objects, threatening machine learning system security. To mitigate this, we adapt knowledge-rich vision-language representations into the ID concept space for object-level OOD detection.
  • Figure 2: Framework of CLIP-based OOD Detection. Green arrows represent ID samples, while red arrows denote OOD samples. The solid line highlights the maximum similarity, and the dotted lines indicate other similarity measures.
  • Figure 3: Overview of RUNA Framework. Our novel dual-encoder architecture computes regional object uncertainty by extracting global and regional image features and aligning them with text features. During fine-tuning, the image encoder handling regional images remain frozen, while only its projection layer (P) participate in the fine-tuning. The upper right dashed box highlights the importance of our feature fusion strategy: when encountering objects with similar semantics, limited local features can lead to incorrect decisions. By incorporating global features, the model can make more informed judgments.
  • Figure 4: Distribution of uncertainty scores for Direct Sum, MCM($\tau$ = 1), MCM($\tau$ = 100) and Max Similarity.
  • Figure 5: Ablation study on the number of fine-tuning samples (shots). This study examines how varying the number of shots affects detection performance, showing the trade-offs between data efficiency and detection quality.