VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
Tianxiang Jiang, Sheng Xia, Yicheng Xu, Linquan Wu, Xiangyu Zeng, Limin Wang, Yu Qiao, Yi Wang
TL;DR
VKnowU introduces a video-based benchmark to quantify visual knowledge understanding in multimodal LLMs, splitting eight tasks into world-centric and human-centric domains. It shows current SOTA MLLMs still lag behind human performance, especially on world-centric knowledge, highlighting the need for richer world models. To address this, the authors propose VideoKnow+, a baseline that integrates visual knowledge via a See-Think-Answer framework and a visual knowledge reward, trained with VKnowQA to improve grounding and generalization. The work demonstrates that explicit visual-knowledge grounding can boost performance on VKnowU and other video benchmarks, suggesting a key direction toward more generalizable, reasoning-capable MLLMs. The findings advocate for broader inclusion of visual knowledge in model design and evaluation to better bridge perception and real-world reasoning.
Abstract
While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.
