RUNA: Object-level Out-of-Distribution Detection via Regional Uncertainty Alignment of Multimodal Representations
Bin Zhang, Jinggang Chen, Xiaoyang Qu, Guokuan Li, Kai Lu, Jiguang Wan, Jing Xiao, Jianzong Wang
TL;DR
This work targets object-level OOD detection by leveraging pre-trained vision-language representations. It introduces RUNA, a dual-encoder framework that fuses global and region-focused image features and aligns regional uncertainty with an ID semantic space built from CLIP text embeddings. A few-shot fine-tuning strategy further sharpens the discrimination between ID and OOD objects by locally aligning region embeddings to ID concepts, while a post-hoc regional uncertainty metric converts detection uncertainty into a distance from the ID space. Empirical results on VOC/BDD-100K as ID with OpenImages and MSCOCO as OOD demonstrate clear gains over state-of-the-art methods, highlighting the practical value of regional CLIP-based alignment for robust, detector-agnostic OOD detection in complex scenes.
Abstract
Enabling object detectors to recognize out-of-distribution (OOD) objects is vital for building reliable systems. A primary obstacle stems from the fact that models frequently do not receive supervisory signals from unfamiliar data, leading to overly confident predictions regarding OOD objects. Despite previous progress that estimates OOD uncertainty based on the detection model and in-distribution (ID) samples, we explore using pre-trained vision-language representations for object-level OOD detection. We first discuss the limitations of applying image-level CLIP-based OOD detection methods to object-level scenarios. Building upon these insights, we propose RUNA, a novel framework that leverages a dual encoder architecture to capture rich contextual information and employs a regional uncertainty alignment mechanism to distinguish ID from OOD objects effectively. We introduce a few-shot fine-tuning approach that aligns region-level semantic representations to further improve the model's capability to discriminate between similar objects. Our experiments show that RUNA substantially surpasses state-of-the-art methods in object-level OOD detection, particularly in challenging scenarios with diverse and complex object instances.
