Table of Contents
Fetching ...

ICPRL: Acquiring Physical Intuition from Interactive Control

Xinrun Xu, Pi Bu, Ye Wang, Börje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Shuo Zhang, Zhiming Ding, Bo Zheng

Abstract

VLMs excel at static perception but falter in interactive reasoning in dynamic physical environments, which demands planning and adaptation to dynamic outcomes. Existing physical reasoning methods often depend on abstract symbolic inputs or lack the ability to learn and adapt from direct, pixel-based visual interaction in novel scenarios. We introduce ICPRL (In-Context Physical Reinforcement Learning), a framework inspired by In-Context Reinforcement Learning (ICRL) that empowers VLMs to acquire physical intuition and adapt their policies in-context. Our approach trains a vision-grounded policy model via multi-turn Group Relative Policy Optimization (GRPO) over diverse multi-episode interaction histories. This enables the agent to adapt strategies by conditioning on past trial-and-error sequences, without requiring any weight updates. This adaptive policy works in concert with a separately trained world model that provides explicit physical reasoning by predicting the results of potential actions. At inference, the policy proposes candidate actions, while the world model predicts outcomes to guide a root-node PUCT search to select the most promising action. Evaluated on the diverse physics-based puzzle-solving tasks in the DeepPHY benchmark, ICPRL demonstrates significant improvements across both its I. policy-only, and II. world-model-augmented stages. Notably, these gains are retained in unseen physical environments, demonstrating that our framework facilitates genuine in-context acquisition of the environment's physical dynamics from interactive experience.

ICPRL: Acquiring Physical Intuition from Interactive Control

Abstract

VLMs excel at static perception but falter in interactive reasoning in dynamic physical environments, which demands planning and adaptation to dynamic outcomes. Existing physical reasoning methods often depend on abstract symbolic inputs or lack the ability to learn and adapt from direct, pixel-based visual interaction in novel scenarios. We introduce ICPRL (In-Context Physical Reinforcement Learning), a framework inspired by In-Context Reinforcement Learning (ICRL) that empowers VLMs to acquire physical intuition and adapt their policies in-context. Our approach trains a vision-grounded policy model via multi-turn Group Relative Policy Optimization (GRPO) over diverse multi-episode interaction histories. This enables the agent to adapt strategies by conditioning on past trial-and-error sequences, without requiring any weight updates. This adaptive policy works in concert with a separately trained world model that provides explicit physical reasoning by predicting the results of potential actions. At inference, the policy proposes candidate actions, while the world model predicts outcomes to guide a root-node PUCT search to select the most promising action. Evaluated on the diverse physics-based puzzle-solving tasks in the DeepPHY benchmark, ICPRL demonstrates significant improvements across both its I. policy-only, and II. world-model-augmented stages. Notably, these gains are retained in unseen physical environments, demonstrating that our framework facilitates genuine in-context acquisition of the environment's physical dynamics from interactive experience.
Paper Structure (28 sections, 4 equations, 3 figures, 7 tables, 2 algorithms)

This paper contains 28 sections, 4 equations, 3 figures, 7 tables, 2 algorithms.

Figures (3)

  • Figure 1: Overview of the ICPRL Framework, which decouples policy learning from world modeling for robust in-context planning. Training Stage: We separately train a Policy Model ($\pi_\theta$) via Turn-Aware GRPO to generate context-aware actions---leveraging $\gamma_{\text{turn}}$ and $\gamma_{\text{token}}$ for precise multi-turn credit assignment---and a World Model ($\mathcal{M}_\phi$) to predict physical outcomes. Inference Stage:$\pi_\theta$ proposes candidate actions, which $\mathcal{M}_\phi$ evaluates by acting as an in-context physical simulator. These evaluations guide a PUCT search to select the optimal action, enabling effective zero-shot planning.
  • Figure 2: The ICPRL framework integrates world model and adaptive policy.
  • Figure 3: Examples of static element annotation in the Cut the Rope game. Key static props—such as Pins, Active Pins, and Air Cushions—are clearly marked with numerical IDs. This method converts pixel-level visual information into grounded tokens, enabling the Agent to accurately identify and manipulate objects.