Table of Contents
Fetching ...

SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

Haowen Liu, Shaoxiong Yao, Haonan Chen, Jiawei Gao, Jiayuan Mao, Jia-Bin Huang, Yilun Du

TL;DR

SIMPACT addresses the gap where Vision-Language Models lack grounded physical dynamics for manipulation. It builds a physics simulator at test time from a single RGB-D image and uses VLMs for sampling, optimization, and success evaluation within simulated rollouts to plan actions. The approach achieves state-of-the-art results on five real-world tasks involving rigid and deformable objects, outperforming baselines and showing the value of simulation-augmented VLM reasoning. This work highlights a promising direction for generalizable embodied intelligence by marrying fast simulation construction with test-time language-based planning.

Abstract

Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence. Project webpage can be found at https://simpact-bot.github.io

SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

TL;DR

SIMPACT addresses the gap where Vision-Language Models lack grounded physical dynamics for manipulation. It builds a physics simulator at test time from a single RGB-D image and uses VLMs for sampling, optimization, and success evaluation within simulated rollouts to plan actions. The approach achieves state-of-the-art results on five real-world tasks involving rigid and deformable objects, outperforming baselines and showing the value of simulation-augmented VLM reasoning. This work highlights a promising direction for generalizable embodied intelligence by marrying fast simulation construction with test-time language-based planning.

Abstract

Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence. Project webpage can be found at https://simpact-bot.github.io

Paper Structure

This paper contains 32 sections, 4 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Simulation-Enable VLM Action Planning. Given a single RGB--D image and a language task description (left), our method efficiently constructs a physics simulator that enables test-time VLM reasoning with physical grounding. This physically grounded reasoning allows the robot to succeed in fine-grained manipulation tasks (bottom), outperforming a vanilla VLM planner (top) that lacks awareness of physical dynamics.
  • Figure 2: Simulation construction from single RGBD image. Given an RGB-D image and a language task description, our pipeline automatically generates either a mesh-based simulation (top) for rigid objects or a particle-based simulation (bottom) for deformables. After segmenting objects-of-interest via GroundedSAM2 ren2025grounded2, we reconstruct either the 3D shape, scale, and pose of the object for rigid-body simulation, or perform dense sampling of particles within the volumes between the object surface and the table for the particle-based simulation pipeline. In both cases, we prompt the VLM to infer the relevant physical parameters required for simulation.
  • Figure 3: Method overview. Our method first begins by instantiating a physics simulator given the real-world scene. Next, a VLM-based action sampler and optimizer iteratively refine the action sequence towards task success using simulated rollouts as context. The final optimized actions are then executed in the real world.
  • Figure 4: Action optimization process. We show a representative example from the non-toppling push task. The left three images show simulation rollouts from initial VLM-sampled action sequence proposals, all of which fail due to insufficient/overshooting push, or because the bottle topples. From these proposals, the VLM optimizer reasons a non-trivial action update that pushes the bottle for the correct distance without toppling in both simulation and real-world execution.
  • Figure 5: Qualitative results. The figure shows the initial state, execution progress, and final state for three of our five tasks in both the real world and the simulation. By leveraging VLM's powerful generalization, rendered simulation images can guide VLM's test-time reasoning for action planning despite the visual sim2real gap. Please refer to our supplementary for the remaining tasks.
  • ...and 8 more figures