Table of Contents
Fetching ...

VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

Tianxiang Jiang, Sheng Xia, Yicheng Xu, Linquan Wu, Xiangyu Zeng, Limin Wang, Yu Qiao, Yi Wang

TL;DR

VKnowU introduces a video-based benchmark to quantify visual knowledge understanding in multimodal LLMs, splitting eight tasks into world-centric and human-centric domains. It shows current SOTA MLLMs still lag behind human performance, especially on world-centric knowledge, highlighting the need for richer world models. To address this, the authors propose VideoKnow+, a baseline that integrates visual knowledge via a See-Think-Answer framework and a visual knowledge reward, trained with VKnowQA to improve grounding and generalization. The work demonstrates that explicit visual-knowledge grounding can boost performance on VKnowU and other video benchmarks, suggesting a key direction toward more generalizable, reasoning-capable MLLMs. The findings advocate for broader inclusion of visual knowledge in model design and evaluation to better bridge perception and real-world reasoning.

Abstract

While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

TL;DR

VKnowU introduces a video-based benchmark to quantify visual knowledge understanding in multimodal LLMs, splitting eight tasks into world-centric and human-centric domains. It shows current SOTA MLLMs still lag behind human performance, especially on world-centric knowledge, highlighting the need for richer world models. To address this, the authors propose VideoKnow+, a baseline that integrates visual knowledge via a See-Think-Answer framework and a visual knowledge reward, trained with VKnowQA to improve grounding and generalization. The work demonstrates that explicit visual-knowledge grounding can boost performance on VKnowU and other video benchmarks, suggesting a key direction toward more generalizable, reasoning-capable MLLMs. The findings advocate for broader inclusion of visual knowledge in model design and evaluation to better bridge perception and real-world reasoning.

Abstract

While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

Paper Structure

This paper contains 66 sections, 4 equations, 27 figures, 11 tables.

Figures (27)

  • Figure 1: VKnowU systematically evaluates visual knowledge understanding of MLLMs across world-centric and human-centric tasks, marking a shift from mere seeing to true understanding of our physical and social worlds.
  • Figure 2: An overview of the VKnowU. Representative videos and QA pairs are shown for each of the 8 tasks.
  • Figure 3: QA filtering pipeline used to construct VKnowU, removing non-visual shortcuts and ensuring each QA requires visual knowledge.
  • Figure 4: Radar chart of MLLM accuracy on VKnowU.
  • Figure 5: Pearson correlation among the 8 visual knowledge tasks. Two clear clusters emerge: world-centric and human-centric.
  • ...and 22 more figures