Table of Contents
Fetching ...

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

Vahid Balazadeh, Mohammadmehdi Ataei, Hyunmin Cheong, Amir Hosein Khasahmadi, Rahul G. Krishnan

TL;DR

This work tackles the persistent challenge of physical reasoning in Vision-Language Models by proposing Physics Context Builders (PCBs), modular, simulation-trained perception modules that generate rich physical descriptions to augment larger foundation models via in-context reasoning. PCBs separate perception from reasoning, enabling efficient transfer of physical knowledge without modifying large models, and leverage two description styles—Human Narration and Structured Physics—to enrich context. Empirical results on CLEVRER and Falling Tower show that physics-focused fine-tuning yields substantial gains, with PCBs further boosting performance when integrated with GPT-4o, GPT-4o-mini, and Gemini in both static and dynamic physical tasks, and demonstrate strong Sim2Real transfer. The work also introduces a multi-agent PCB integration framework and conducts thorough ablations to understand data composition and QA-type effects, highlighting both the potential and current limitations of modular, simulation-driven approaches to physical reasoning in VLMs.

Abstract

Physical reasoning remains a significant challenge for Vision-Language Models (VLMs). This limitation arises from an inability to translate learned knowledge into predictions about physical behavior. Although continual fine-tuning can mitigate this issue, it is expensive for large models and impractical to perform repeatedly for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts to enhance the reasoning capabilities of larger VLMs. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. Notably, PCBs also show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes.

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

TL;DR

This work tackles the persistent challenge of physical reasoning in Vision-Language Models by proposing Physics Context Builders (PCBs), modular, simulation-trained perception modules that generate rich physical descriptions to augment larger foundation models via in-context reasoning. PCBs separate perception from reasoning, enabling efficient transfer of physical knowledge without modifying large models, and leverage two description styles—Human Narration and Structured Physics—to enrich context. Empirical results on CLEVRER and Falling Tower show that physics-focused fine-tuning yields substantial gains, with PCBs further boosting performance when integrated with GPT-4o, GPT-4o-mini, and Gemini in both static and dynamic physical tasks, and demonstrate strong Sim2Real transfer. The work also introduces a multi-agent PCB integration framework and conducts thorough ablations to understand data composition and QA-type effects, highlighting both the potential and current limitations of modular, simulation-driven approaches to physical reasoning in VLMs.

Abstract

Physical reasoning remains a significant challenge for Vision-Language Models (VLMs). This limitation arises from an inability to translate learned knowledge into predictions about physical behavior. Although continual fine-tuning can mitigate this issue, it is expensive for large models and impractical to perform repeatedly for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts to enhance the reasoning capabilities of larger VLMs. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. Notably, PCBs also show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes.

Paper Structure

This paper contains 39 sections, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Physics Context Builders (PCBs) pipeline. The training phase (top) shows how physics simulators generate images/videos with corresponding annotations, which are converted into two types of physical descriptions: Human-like Narration (providing natural language scene descriptions with object properties and spatial relationships) and Structured Physics (offering frame-by-frame structured descriptions with physical properties). These descriptions serve as training data for fine-tuning a relatively small VLM into specialized PCBs. During the inference phase (bottom), a trained PCB processes a new image/video and generates detailed physical context about the scene, which is then provided to a foundation model (e.g., GPT-4o) alongside a user query to produce physically grounded responses.
  • Figure 2: Datasets used for experiments: Falling Tower representing static physics in simulated and real environments; CLEVRER representing dynamic physics in a simulated environment.
  • Figure 3: The amount of training data vs. performance on different QA types in Falling Tower.
  • Figure 4: Multi-agent framework for PCB integration. A triage agent selects the appropriate PCB based on the user query, and the PCB generates a detailed scene description that enriches the foundation model’s context.
  • Figure 5: Falling Tower dataset: (a) distribution of stacked objects in terms of their stability, and (b) distribution of question types, including both descriptive and stability categories.
  • ...and 2 more figures