Table of Contents
Fetching ...

Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics

Sadaf Ghaffari, Nikhil Krishnaswamy

TL;DR

This work interrogates how large language models handle physical dynamics in situated environments, showing that zero-shot LLM outputs often rely on atomic object knowledge but fail to respect environmental physics and object properties. By building a VoxWorld-based open and controlled simulation, the authors systematically evaluate text-only and multimodal models (including BLIP) on object selection and stability, revealing consistent grounding gaps. They introduce an exploration-driven framework that uses interaction with objects and VoxML-based affordances to discover physically grounded configurations, then propose a distillation approach to transfer this grounded knowledge back into the LLM via attention and embedding losses, aligned with simulation-derived rewards. The study highlights fundamental limits of current LLMs for causal physical reasoning and outlines a concrete, scalable pathway to ground language models in physical laws for robust, situated reasoning, with broader implications for embodied AI and robotics planning.

Abstract

In this paper, we present an exploration of LLMs' abilities to problem solve with physical reasoning in situated environments. We construct a simple simulated environment and demonstrate examples of where, in a zero-shot setting, both text and multimodal LLMs display atomic world knowledge about various objects but fail to compose this knowledge in correct solutions for an object manipulation and placement task. We also use BLIP, a vision-language model trained with more sophisticated cross-modal attention, to identify cases relevant to object physical properties that that model fails to ground. Finally, we present a procedure for discovering the relevant properties of objects in the environment and propose a method to distill this knowledge back into the LLM.

Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics

TL;DR

This work interrogates how large language models handle physical dynamics in situated environments, showing that zero-shot LLM outputs often rely on atomic object knowledge but fail to respect environmental physics and object properties. By building a VoxWorld-based open and controlled simulation, the authors systematically evaluate text-only and multimodal models (including BLIP) on object selection and stability, revealing consistent grounding gaps. They introduce an exploration-driven framework that uses interaction with objects and VoxML-based affordances to discover physically grounded configurations, then propose a distillation approach to transfer this grounded knowledge back into the LLM via attention and embedding losses, aligned with simulation-derived rewards. The study highlights fundamental limits of current LLMs for causal physical reasoning and outlines a concrete, scalable pathway to ground language models in physical laws for robust, situated reasoning, with broader implications for embodied AI and robotics planning.

Abstract

In this paper, we present an exploration of LLMs' abilities to problem solve with physical reasoning in situated environments. We construct a simple simulated environment and demonstrate examples of where, in a zero-shot setting, both text and multimodal LLMs display atomic world knowledge about various objects but fail to compose this knowledge in correct solutions for an object manipulation and placement task. We also use BLIP, a vision-language model trained with more sophisticated cross-modal attention, to identify cases relevant to object physical properties that that model fails to ground. Finally, we present a procedure for discovering the relevant properties of objects in the environment and propose a method to distill this knowledge back into the LLM.
Paper Structure (13 sections, 3 equations, 10 figures, 2 tables)

This paper contains 13 sections, 3 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Example of physical reasoning prompt and response, and operationalizations of solutions as given by ChatGPT [L], LLaMA 2-7B touvron2023llama [C], and LLaVA liu2023visual [R].
  • Figure 2: A feasible, physically stable solution to the platform reach problem, that uses the cylinder in an orientation that exploits the stability affordances of its flat ends.
  • Figure 3: Examples of controls placed on the visual input to LLaVA.
  • Figure 4: Per-word BLIP visual grounding for " blue cylinder near wall" (top) and " blue cube behind cylinder" (bottom).
  • Figure 5: Per-word BLIP visual grounding for " cylinder lying on its flat side" [L] and " cylinder lying on its round side" [R].
  • ...and 5 more figures