Table of Contents
Fetching ...

HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies

Chang Liu, Yunfan Ye, Qingyang Zhou, Xichen Tan, Mengxuan Luo, Zhenyu Qiu, Wei Peng, Zhiping Cai

TL;DR

HOCA-Bench introduces a Hegelian framework for predictive world modeling by separating anomalies into Ontological (Being) and Causal (Essence) categories. Using adversarial generative videos, it builds a testbed of 1,439 videos with 3,470 QA pairs to probe Video-LMMs on existence maintenance and physical-law reasoning. Across 17 models, results show strong performance on ontological tasks but weaker performance on causal reasoning, with System-2 thinking improving but not eliminating the gap; the study formalizes this with the H-Index and demonstrates scaling and architecture effects. The work provides a diagnostic toolkit for advancing physical world modeling in video-language systems and outlines directions to close the gap between pattern recognition and physically grounded inference, with $H = \frac{1}{4}(S_{I}+S_{II}+S_{III}+S_{IV})$ as a key holistic metric.

Abstract

Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling, which is central to physically grounded intelligence. We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens. HOCA-Bench separates anomalies into two types: ontological anomalies, where an entity violates its own definition or persistence, and causal anomalies, where interactions violate physical relations. Using state-of-the-art generative video models as adversarial simulators, we build a testbed of 1,439 videos (3,470 QA pairs). Evaluations on 17 Video-LLMs show a clear cognitive lag: models often identify static ontological violations (e.g., shape mutations) but struggle with causal mechanisms (e.g., gravity or friction), with performance dropping by more than 20% on causal tasks. System-2 "Thinking" modes improve reasoning, but they do not close the gap, suggesting that current architectures recognize visual patterns more readily than they apply basic physical laws.

HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies

TL;DR

HOCA-Bench introduces a Hegelian framework for predictive world modeling by separating anomalies into Ontological (Being) and Causal (Essence) categories. Using adversarial generative videos, it builds a testbed of 1,439 videos with 3,470 QA pairs to probe Video-LMMs on existence maintenance and physical-law reasoning. Across 17 models, results show strong performance on ontological tasks but weaker performance on causal reasoning, with System-2 thinking improving but not eliminating the gap; the study formalizes this with the H-Index and demonstrates scaling and architecture effects. The work provides a diagnostic toolkit for advancing physical world modeling in video-language systems and outlines directions to close the gap between pattern recognition and physically grounded inference, with as a key holistic metric.

Abstract

Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling, which is central to physically grounded intelligence. We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens. HOCA-Bench separates anomalies into two types: ontological anomalies, where an entity violates its own definition or persistence, and causal anomalies, where interactions violate physical relations. Using state-of-the-art generative video models as adversarial simulators, we build a testbed of 1,439 videos (3,470 QA pairs). Evaluations on 17 Video-LLMs show a clear cognitive lag: models often identify static ontological violations (e.g., shape mutations) but struggle with causal mechanisms (e.g., gravity or friction), with performance dropping by more than 20% on causal tasks. System-2 "Thinking" modes improve reasoning, but they do not close the gap, suggesting that current architectures recognize visual patterns more readily than they apply basic physical laws.
Paper Structure (41 sections, 1 equation, 10 figures, 6 tables)

This paper contains 41 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: The HOCA-Bench Taxonomy and Hegelian Logic Layers. We categorize physical anomalies into two fundamental logic layers: (Left) Ontological Anomalies (Collapse of Identity) represent violations of an entity's inherent definition, such as a three-headed sheep, a red crystal tortoise shell, or the biological impossibility of a reverse-bending elbow. (Right) Causal Anomalies (Violation of Relations) denote failures in the interaction logic between objects, including cats clipping through solids, incongruent shadow projections, or chemical inconsistencies such as fire without burn marks. HOCA-Bench evaluates these dimensions across spatial and temporal axes to audit predictive world modeling.
  • Figure 2: Visualization of the HOCA-Bench annotation pipeline. The process begins with Source Collection (Generative/Real videos), followed by a Coarse-to-Fine Analysis where VLMs generate dense captions and LLMs aggregate physical logic. Finally, anomalies are mapped to the Hegelian Taxonomy and undergo rigorous Human Verification to ensure high-quality grounding.
  • Figure 3: Structured Task Design in HOCA-Bench. The benchmark evaluates physical understanding through four progressive tasks: (Task I) Binary Plausibility Check, (Task II) Domain Categorization, (Task III) Fine-grained Anomaly Description, and (Task IV) Open-ended Counterfactual Reasoning. This structure probes model capability from coarse perception to deep causal inference.
  • Figure 4: Qualitative Case Study of the "Cognitive Lag." In this video, a coffee machine dispenses liquid but the mug's level remains constant. (Top-Down): Lightweight models (InternVL-3.5-2B) are "physically blind" to the violation. Standard Reasoning (Qwen3-VL-32B, GLM-4.6V-106B) often succumbs to hallucinations, inventing "upward flow" or "overflow" to reconcile the visual contradiction. Only Specialized Reasoning (GLM-4.6V-106B(T), Gemini-2.5-flash) successfully invokes mass conservation to identify the non-rising level.
  • Figure 5: Comparison of H-Index across VLM families. The results highlight the impact of model scaling and architectural iterations (e.g., from InternVL 2.5 to 3.5) on physical world modeling.
  • ...and 5 more figures