Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

Vahid Balazadeh; Mohammadmehdi Ataei; Hyunmin Cheong; Amir Hosein Khasahmadi; Rahul G. Krishnan

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

Vahid Balazadeh, Mohammadmehdi Ataei, Hyunmin Cheong, Amir Hosein Khasahmadi, Rahul G. Krishnan

TL;DR

This work tackles the persistent challenge of physical reasoning in Vision-Language Models by proposing Physics Context Builders (PCBs), modular, simulation-trained perception modules that generate rich physical descriptions to augment larger foundation models via in-context reasoning. PCBs separate perception from reasoning, enabling efficient transfer of physical knowledge without modifying large models, and leverage two description styles—Human Narration and Structured Physics—to enrich context. Empirical results on CLEVRER and Falling Tower show that physics-focused fine-tuning yields substantial gains, with PCBs further boosting performance when integrated with GPT-4o, GPT-4o-mini, and Gemini in both static and dynamic physical tasks, and demonstrate strong Sim2Real transfer. The work also introduces a multi-agent PCB integration framework and conducts thorough ablations to understand data composition and QA-type effects, highlighting both the potential and current limitations of modular, simulation-driven approaches to physical reasoning in VLMs.

Abstract

Physical reasoning remains a significant challenge for Vision-Language Models (VLMs). This limitation arises from an inability to translate learned knowledge into predictions about physical behavior. Although continual fine-tuning can mitigate this issue, it is expensive for large models and impractical to perform repeatedly for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts to enhance the reasoning capabilities of larger VLMs. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. Notably, PCBs also show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes.

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

TL;DR

Abstract

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)