Exploring Failure Cases in Multimodal Reasoning About Physical Dynamics
Sadaf Ghaffari, Nikhil Krishnaswamy
TL;DR
This work interrogates how large language models handle physical dynamics in situated environments, showing that zero-shot LLM outputs often rely on atomic object knowledge but fail to respect environmental physics and object properties. By building a VoxWorld-based open and controlled simulation, the authors systematically evaluate text-only and multimodal models (including BLIP) on object selection and stability, revealing consistent grounding gaps. They introduce an exploration-driven framework that uses interaction with objects and VoxML-based affordances to discover physically grounded configurations, then propose a distillation approach to transfer this grounded knowledge back into the LLM via attention and embedding losses, aligned with simulation-derived rewards. The study highlights fundamental limits of current LLMs for causal physical reasoning and outlines a concrete, scalable pathway to ground language models in physical laws for robust, situated reasoning, with broader implications for embodied AI and robotics planning.
Abstract
In this paper, we present an exploration of LLMs' abilities to problem solve with physical reasoning in situated environments. We construct a simple simulated environment and demonstrate examples of where, in a zero-shot setting, both text and multimodal LLMs display atomic world knowledge about various objects but fail to compose this knowledge in correct solutions for an object manipulation and placement task. We also use BLIP, a vision-language model trained with more sophisticated cross-modal attention, to identify cases relevant to object physical properties that that model fails to ground. Finally, we present a procedure for discovering the relevant properties of objects in the environment and propose a method to distill this knowledge back into the LLM.
