Table of Contents
Fetching ...

ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models

Zhaoyang Li, Zhan Ling, Yuchen Zhou, Litian Gong, Erdem Bıyık, Hao Su

TL;DR

This work identifies contextual incongruity as a major source of uncertainty in large vision-language models, showing that objects placed in atypical contexts can cause misidentification or hallucination. It introduces ORIC, a framework that builds incongruous object-context data via LLM-guided positives and CLIP-guided negatives to evaluate and train LVLMs, and establishes ORIC-Bench on MSCOCO to quantify performance gaps across 18 LVLMs and 2 detectors. The authors demonstrate substantial performance drops under incongruity and show that Visual Reinforcement Fine-Tuning with GRPO on ORIC-style data yields more human-aligned reasoning and improved robustness across related benchmarks. The results underscore the need for uncertainty-aware training regimes in LVLMs and provide practical tools for developing more reliable perception in complex, context-shifting environments.

Abstract

Large Vision-Language Models (LVLMs) excel at captioning, visual question answering, and robotics by combining vision and language, yet they often miss obvious objects or hallucinate nonexistent ones in atypical scenes. We examine these failures through the lens of uncertainty, focusing on contextual incongruity, where objects appear unexpectedly or fail to appear in expected contexts, and show that such cases increase recognition difficulty for state-of-the-art LVLMs. To study this regime, we introduce the Object Recognition in Incongruous Context (ORIC) framework, which constructs incongruous object-context pairs through two complementary strategies: (1) LLM-guided sampling to identify hard-to-recognize objects present in the image and (2) CLIP-guided sampling to mine plausible but absent ones. Applied to MSCOCO, ORIC produces ORIC-Bench and ORIC-style training data. Evaluating 18 LVLMs and 2 open-vocabulary detectors reveals substantial performance drops and bias patterns under incongruous contexts. Fine-tuning Qwen3-VL-8B-Instruct with Visual Reinforcement Fine-Tuning on 600 ORIC-style samples improves results on ORIC-Bench, AMBER, and HallusionBench. Overall, we show that contextual incongruity is a key source of uncertainty and provide tools for more reliable LVLMs. The code is available at https://github.com/ZhaoyangLi-1/ORIC.

ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models

TL;DR

This work identifies contextual incongruity as a major source of uncertainty in large vision-language models, showing that objects placed in atypical contexts can cause misidentification or hallucination. It introduces ORIC, a framework that builds incongruous object-context data via LLM-guided positives and CLIP-guided negatives to evaluate and train LVLMs, and establishes ORIC-Bench on MSCOCO to quantify performance gaps across 18 LVLMs and 2 detectors. The authors demonstrate substantial performance drops under incongruity and show that Visual Reinforcement Fine-Tuning with GRPO on ORIC-style data yields more human-aligned reasoning and improved robustness across related benchmarks. The results underscore the need for uncertainty-aware training regimes in LVLMs and provide practical tools for developing more reliable perception in complex, context-shifting environments.

Abstract

Large Vision-Language Models (LVLMs) excel at captioning, visual question answering, and robotics by combining vision and language, yet they often miss obvious objects or hallucinate nonexistent ones in atypical scenes. We examine these failures through the lens of uncertainty, focusing on contextual incongruity, where objects appear unexpectedly or fail to appear in expected contexts, and show that such cases increase recognition difficulty for state-of-the-art LVLMs. To study this regime, we introduce the Object Recognition in Incongruous Context (ORIC) framework, which constructs incongruous object-context pairs through two complementary strategies: (1) LLM-guided sampling to identify hard-to-recognize objects present in the image and (2) CLIP-guided sampling to mine plausible but absent ones. Applied to MSCOCO, ORIC produces ORIC-Bench and ORIC-style training data. Evaluating 18 LVLMs and 2 open-vocabulary detectors reveals substantial performance drops and bias patterns under incongruous contexts. Fine-tuning Qwen3-VL-8B-Instruct with Visual Reinforcement Fine-Tuning on 600 ORIC-style samples improves results on ORIC-Bench, AMBER, and HallusionBench. Overall, we show that contextual incongruity is a key source of uncertainty and provide tools for more reliable LVLMs. The code is available at https://github.com/ZhaoyangLi-1/ORIC.

Paper Structure

This paper contains 58 sections, 11 equations, 14 figures, 10 tables, 2 algorithms.

Figures (14)

  • Figure 1: Contextual Incongruity Leads to Recognition Failures. This figure illustrates how incongruous contexts cause two primary errors: misidentification of present objects and hallucination of absent ones. Left (Misidentification): In an office, GPT-5 identifies the expected "mouse" (purple) but fails to recognize the out-of-context "train" (red). Right (Hallucination): On a baseball court, the model correctly denies an unrelated "car" but hallucinates a plausible yet non-existent "sports ball."
  • Figure 2: Comparison of POPE and Incongruous Context Questions. Both examples use the same image but differ in target objects. Left: In a baseball field, POPE targets a baseball bat (purple), while ours targets a large vehicle (red), which is less related to the scene and thus more incongruous. Both labels are “yes.” Right: In a rural scene with a cow, POPE targets a truck, while our question targets a sheep—more contextually plausible but still absent, increasing incongruity. Both labels are “no.”
  • Figure 3: Object–Context Congruity via CLIPScore. CLIPScore quantifies alignment between queried objects and scene context. (a) For “yes” questions, POPE subset yields higher scores than incongruous variants (23.83 vs. 20.77); for “no” questions, the reverse holds (22.87 vs. 20.18), indicating stronger misleading cues. (b) The sampled POPE subset shows consistent CLIPScore distribution with the full dataset, confirming its representativeness. (c) ORIC questions exhibit even higher incongruity (e.g., 24.26 for “no”), reinforcing the contextual challenge. Subplots (a) and (c) share images but differ in queried objects. Error bars show 95% confidence intervals.
  • Figure 4: ORIC Method Overview. This figure shows two construction methods of the ORIC. LLM-Guided Sampling (Positive Question Construction): First, given an image $I$, objects are classified as ROI if their combined bounding box area is under $50\%$; otherwise, they are non-ROI. Next, we query the LLM (GPT-5) with textual categories of non-ROI objects to predict the existence of each ROI object based on common sense and co-occurrence. Finally, we select the top $k$ unpredictable ROI objects (e.g., $k=3$) for which the LLM predicts “no” (e.g., apple, banana, and orange). CLIP-Guided Sampling (Negative Question Construction): A similar image $I^{\prime}$ is identified using cosine distance from $I$. We then compute the CLIPScore for each nonexistent ROI object against $I^{\prime}$ and select the top $k$ nonexistent ROI objects based on their scores. For example, the top three are an oven (57.46), a microwave (21.79), and a spoon (16.32).
  • Figure 5: Performance across Benchmarks. Macro F1 on HallusionBench and AMBER under three settings: with/without zero-shot CoT and Visual-RFT fine-tuning.
  • ...and 9 more figures