HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies
Chang Liu, Yunfan Ye, Qingyang Zhou, Xichen Tan, Mengxuan Luo, Zhenyu Qiu, Wei Peng, Zhiping Cai
TL;DR
HOCA-Bench introduces a Hegelian framework for predictive world modeling by separating anomalies into Ontological (Being) and Causal (Essence) categories. Using adversarial generative videos, it builds a testbed of 1,439 videos with 3,470 QA pairs to probe Video-LMMs on existence maintenance and physical-law reasoning. Across 17 models, results show strong performance on ontological tasks but weaker performance on causal reasoning, with System-2 thinking improving but not eliminating the gap; the study formalizes this with the H-Index and demonstrates scaling and architecture effects. The work provides a diagnostic toolkit for advancing physical world modeling in video-language systems and outlines directions to close the gap between pattern recognition and physically grounded inference, with $H = \frac{1}{4}(S_{I}+S_{II}+S_{III}+S_{IV})$ as a key holistic metric.
Abstract
Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling, which is central to physically grounded intelligence. We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens. HOCA-Bench separates anomalies into two types: ontological anomalies, where an entity violates its own definition or persistence, and causal anomalies, where interactions violate physical relations. Using state-of-the-art generative video models as adversarial simulators, we build a testbed of 1,439 videos (3,470 QA pairs). Evaluations on 17 Video-LLMs show a clear cognitive lag: models often identify static ontological violations (e.g., shape mutations) but struggle with causal mechanisms (e.g., gravity or friction), with performance dropping by more than 20% on causal tasks. System-2 "Thinking" modes improve reasoning, but they do not close the gap, suggesting that current architectures recognize visual patterns more readily than they apply basic physical laws.
