ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
Zhaoyang Li, Zhan Ling, Yuchen Zhou, Litian Gong, Erdem Bıyık, Hao Su
TL;DR
This work identifies contextual incongruity as a major source of uncertainty in large vision-language models, showing that objects placed in atypical contexts can cause misidentification or hallucination. It introduces ORIC, a framework that builds incongruous object-context data via LLM-guided positives and CLIP-guided negatives to evaluate and train LVLMs, and establishes ORIC-Bench on MSCOCO to quantify performance gaps across 18 LVLMs and 2 detectors. The authors demonstrate substantial performance drops under incongruity and show that Visual Reinforcement Fine-Tuning with GRPO on ORIC-style data yields more human-aligned reasoning and improved robustness across related benchmarks. The results underscore the need for uncertainty-aware training regimes in LVLMs and provide practical tools for developing more reliable perception in complex, context-shifting environments.
Abstract
Large Vision-Language Models (LVLMs) excel at captioning, visual question answering, and robotics by combining vision and language, yet they often miss obvious objects or hallucinate nonexistent ones in atypical scenes. We examine these failures through the lens of uncertainty, focusing on contextual incongruity, where objects appear unexpectedly or fail to appear in expected contexts, and show that such cases increase recognition difficulty for state-of-the-art LVLMs. To study this regime, we introduce the Object Recognition in Incongruous Context (ORIC) framework, which constructs incongruous object-context pairs through two complementary strategies: (1) LLM-guided sampling to identify hard-to-recognize objects present in the image and (2) CLIP-guided sampling to mine plausible but absent ones. Applied to MSCOCO, ORIC produces ORIC-Bench and ORIC-style training data. Evaluating 18 LVLMs and 2 open-vocabulary detectors reveals substantial performance drops and bias patterns under incongruous contexts. Fine-tuning Qwen3-VL-8B-Instruct with Visual Reinforcement Fine-Tuning on 600 ORIC-style samples improves results on ORIC-Bench, AMBER, and HallusionBench. Overall, we show that contextual incongruity is a key source of uncertainty and provide tools for more reliable LVLMs. The code is available at https://github.com/ZhaoyangLi-1/ORIC.
