Table of Contents
Fetching ...

InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

Gautam Sreekumar, Vishnu Naresh Boddeti

Abstract

Large multimodal models (LMMs) encode physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning in inference scenarios that follow physical laws unseen during training. In such novel physical environments, humans could adapt their physical reasoning based on provided demonstrations. This inductive physical reasoning ability is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks do not evaluate inductive physical reasoning and only consider the parametric knowledge in LMMs. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs' ability to predict the outcome of collision events in algorithmically generated synthetic videos. By inspecting over 13 open-source and proprietary LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when the physical laws underlying inference scenarios were unseen during training, and (3) inductive physical reasoning in LMMs suffers from language bias and may ignore the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs.

InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

Abstract

Large multimodal models (LMMs) encode physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning in inference scenarios that follow physical laws unseen during training. In such novel physical environments, humans could adapt their physical reasoning based on provided demonstrations. This inductive physical reasoning ability is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks do not evaluate inductive physical reasoning and only consider the parametric knowledge in LMMs. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs' ability to predict the outcome of collision events in algorithmically generated synthetic videos. By inspecting over 13 open-source and proprietary LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when the physical laws underlying inference scenarios were unseen during training, and (3) inductive physical reasoning in LMMs suffers from language bias and may ignore the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs.

Paper Structure

This paper contains 41 sections, 1 equation, 27 figures, 9 tables.

Figures (27)

  • Figure 1: (Left) A large multimodal model (LMM) is asked to predict the change in vertical velocity of an object colliding with a vertical wall. The model will output "possibility 2" if it uses its parametric knowledge that encodes the universal physical laws (in this case, the momentum conservation principle). However, parametric knowledge would be insufficient if the collision event violated the physical laws encoded in the model. For the model to infer the underlying physical laws, we provide the model with exemplar videos of collisions that violate the momentum conservation principle. The model may now rely on its inductive physical reasoning capabilities to generate "possibility 1". (Right) InPhyRe shows that LMMs struggle with inductive physical reasoning.
  • Figure 2: InPhyRe comprises videos ("visual inputs") of collision events that violate a real-world physical law ("violation"). LMMs must predict state changes in objects due to the collisions, while accounting for the violated physical law ("task"). The videos are grouped into "scenarios", which are further grouped into three categories based on the nature of physical law they violate. Arrows indicate object motion and are not part of the actual images in the dataset.
  • Figure 3: We initialize the object states in PyBullet. When a collision occurs during the simulation, we intervene and manually adjust the objects' states such that the resulting trajectory violates some real-world physical law. The object trajectories are then used by Blender to render the final video.
  • Figure 4: Difference in 3-shot accuracy of LMMs between irregular and regular scenarios when exemplars contain both videos and QA pairs.
  • Figure 5: Change in 3-shot accuracy in irregular scenarios between video-only and video-text settings.
  • ...and 22 more figures