Table of Contents
Fetching ...

Physics Knowledge in Frontier Models: A Diagnostic Study of Failure Modes

Ieva Bagdonaviciute, Vibhav Vineet

TL;DR

This study tackles the problem that benchmark scores mask whether frontier visual-language models truly understand physical dynamics. It introduces a diagnostic framework that separates perception grounding from physics reasoning, applying it to Physion, Physion++, and CLEVRER across six state-of-the-art models. The findings show a weak link between subtest mastery and overall benchmark accuracy, with models often succeeding without proper grounding and struggling with dynamic reasoning, especially under counterfactual conditions. The work highlights the need for granular, structure-preserving benchmarks that jointly evaluate perception, causal inference, and counterfactual reasoning to drive more robust physical understanding in multimodal models.

Abstract

While recent Vision-Language Models (VLMs) have achieved impressive progress, it remains difficult to determine why they succeed or fail on complex reasoning tasks. Traditional benchmarks evaluate what models can answer correctly, not why they succeed or fail. In this work, we perform a failure-mode analysis of six frontier VLMs on three physics-based benchmarks - Physion, Physion++, and CLEVRER - by introducing custom subtests (for Physion and Physion++) and an integration of existing benchmark categories (for CLEVRER) to factor benchmark performance into distinct, testable capabilities. These subtests isolate perception (object, color, and occlusion recognition) and physics understanding (motion prediction and spatial reasoning), enabling us to test whether models attend to the correct entities and dynamics underlying their answers. Counterintuitively, subtest mastery correlates only weakly with benchmark accuracy: models often answer correctly without grounding in perception or physics. This suggests that current VLMs sometimes achieve benchmark scores for the wrong reasons, underscoring the need for diagnostics that expose hidden failure modes beyond aggregate metrics.

Physics Knowledge in Frontier Models: A Diagnostic Study of Failure Modes

TL;DR

This study tackles the problem that benchmark scores mask whether frontier visual-language models truly understand physical dynamics. It introduces a diagnostic framework that separates perception grounding from physics reasoning, applying it to Physion, Physion++, and CLEVRER across six state-of-the-art models. The findings show a weak link between subtest mastery and overall benchmark accuracy, with models often succeeding without proper grounding and struggling with dynamic reasoning, especially under counterfactual conditions. The work highlights the need for granular, structure-preserving benchmarks that jointly evaluate perception, causal inference, and counterfactual reasoning to drive more robust physical understanding in multimodal models.

Abstract

While recent Vision-Language Models (VLMs) have achieved impressive progress, it remains difficult to determine why they succeed or fail on complex reasoning tasks. Traditional benchmarks evaluate what models can answer correctly, not why they succeed or fail. In this work, we perform a failure-mode analysis of six frontier VLMs on three physics-based benchmarks - Physion, Physion++, and CLEVRER - by introducing custom subtests (for Physion and Physion++) and an integration of existing benchmark categories (for CLEVRER) to factor benchmark performance into distinct, testable capabilities. These subtests isolate perception (object, color, and occlusion recognition) and physics understanding (motion prediction and spatial reasoning), enabling us to test whether models attend to the correct entities and dynamics underlying their answers. Counterintuitively, subtest mastery correlates only weakly with benchmark accuracy: models often answer correctly without grounding in perception or physics. This suggests that current VLMs sometimes achieve benchmark scores for the wrong reasons, underscoring the need for diagnostics that expose hidden failure modes beyond aggregate metrics.

Paper Structure

This paper contains 17 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustrative examples of physical reasoning using Physion++. Left: a cone approaches a wall with a hole and passes through to reach the goal. Right: the same cone approaches a solid wall and is blocked.
  • Figure 2: General pipeline for evaluation and analysis.
  • Figure 3: Evaluation accuracy. Left: Physion/Physion++ under WP and WOP. Right: CLEVRER counterfactual.
  • Figure 4: Perception accuracy. Left: Physion/Physion++ (target, goal, latent). Right: CLEVRER (descriptive, explanatory).
  • Figure 5: Physics reasoning performance. Left: Physion/Physion++ (motion, spatial). Right: CLEVRER (predictive).
  • ...and 1 more figures