Physics Knowledge in Frontier Models: A Diagnostic Study of Failure Modes
Ieva Bagdonaviciute, Vibhav Vineet
TL;DR
This study tackles the problem that benchmark scores mask whether frontier visual-language models truly understand physical dynamics. It introduces a diagnostic framework that separates perception grounding from physics reasoning, applying it to Physion, Physion++, and CLEVRER across six state-of-the-art models. The findings show a weak link between subtest mastery and overall benchmark accuracy, with models often succeeding without proper grounding and struggling with dynamic reasoning, especially under counterfactual conditions. The work highlights the need for granular, structure-preserving benchmarks that jointly evaluate perception, causal inference, and counterfactual reasoning to drive more robust physical understanding in multimodal models.
Abstract
While recent Vision-Language Models (VLMs) have achieved impressive progress, it remains difficult to determine why they succeed or fail on complex reasoning tasks. Traditional benchmarks evaluate what models can answer correctly, not why they succeed or fail. In this work, we perform a failure-mode analysis of six frontier VLMs on three physics-based benchmarks - Physion, Physion++, and CLEVRER - by introducing custom subtests (for Physion and Physion++) and an integration of existing benchmark categories (for CLEVRER) to factor benchmark performance into distinct, testable capabilities. These subtests isolate perception (object, color, and occlusion recognition) and physics understanding (motion prediction and spatial reasoning), enabling us to test whether models attend to the correct entities and dynamics underlying their answers. Counterintuitively, subtest mastery correlates only weakly with benchmark accuracy: models often answer correctly without grounding in perception or physics. This suggests that current VLMs sometimes achieve benchmark scores for the wrong reasons, underscoring the need for diagnostics that expose hidden failure modes beyond aggregate metrics.
