Table of Contents
Fetching ...

VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

Ming Zhong, Yuanlei Wang, Liuzhou Zhang, Arctanx An, Renrui Zhang, Hao Liang, Ming Lu, Ying Shen, Wentao Zhang

TL;DR

VCU-Bridge is presented, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions.

Abstract

While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and high-level concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on HVCU-Bench but also brings benefits on general benchmarks (average +2.53%), especially with substantial gains on MMStar (+7.26%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities. The project page is at https://vcu-bridge.github.io .

VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

TL;DR

VCU-Bridge is presented, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions.

Abstract

While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and high-level concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on HVCU-Bench but also brings benefits on general benchmarks (average +2.53%), especially with substantial gains on MMStar (+7.26%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities. The project page is at https://vcu-bridge.github.io .

Paper Structure

This paper contains 65 sections, 29 figures, 10 tables.

Figures (29)

  • Figure 1: Showcase of different pattern between human and model. Models can appear capable by correctly answering both concrete and abstract questions while fundamentally failing at the reasoning that bridges them. Current evaluation may miss critical reasoning failures when models produce correct answers at both concrete and abstract levels.
  • Figure 2: Overview of HVCU-Bench. We evaluate MLLMs across 3 task families spanning 15 diverse aspects (top left). Our benchmark employs hierarchical decomposition: each question is systematically broken down into sub-questions across three levels ($L_{perc}$, $L_{bridge}$, $L_{conn}$), with validation ensuring logical coherence. During evaluation, models progress from low to high levels, constructing inter-level reasoning chains that emulate human visual comprehension (bottom). While GPT-4o achieves top performance among MLLMs, it falls substantially short of human capability, exposing a significant gap (top right).
  • Figure 3: Overview of our hierarchical data generation pipeline. An MCTS-driven approach for generating high-quality hierarchical training data.
  • Figure 4: Performance of Qwen3-VL-4B-Bridge on (left) HVCU-Bench and (right) general benchmarks.
  • Figure 5: Representative case study. Training on hierarchical data corrects semantic bridging failures.
  • ...and 24 more figures