SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

Haowen Liu; Shaoxiong Yao; Haonan Chen; Jiawei Gao; Jiayuan Mao; Jia-Bin Huang; Yilun Du

SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

Haowen Liu, Shaoxiong Yao, Haonan Chen, Jiawei Gao, Jiayuan Mao, Jia-Bin Huang, Yilun Du

TL;DR

SIMPACT addresses the gap where Vision-Language Models lack grounded physical dynamics for manipulation. It builds a physics simulator at test time from a single RGB-D image and uses VLMs for sampling, optimization, and success evaluation within simulated rollouts to plan actions. The approach achieves state-of-the-art results on five real-world tasks involving rigid and deformable objects, outperforming baselines and showing the value of simulation-augmented VLM reasoning. This work highlights a promising direction for generalizable embodied intelligence by marrying fast simulation construction with test-time language-based planning.

Abstract

Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence. Project webpage can be found at https://simpact-bot.github.io

SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

TL;DR

Abstract

SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)