Table of Contents
Fetching ...

Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation

Maggie Wang, Stephen Tian, Aiden Swann, Ola Shorinwa, Jiajun Wu, Mac Schwager

TL;DR

Phys2Real tackles the sim-to-real manipulation gap by fusing visual physics priors from VLMs with online, ensemble-based adaptation to produce physics-conditioned policies. It constructs physically informed digital twins via real-to-sim scene reconstruction with Gaussian Splatting, trains parameter-conditioned policies with a two-stage online adaptation mechanism, and performs test-time fusion using inverse-variance weighting to integrate VLM priors and interaction data. The approach yields substantial gains over domain randomization on T-block and hammer pushing, particularly under varying CoM and off-center mass distributions, while preserving efficiency in execution. This work demonstrates that combining foundation-model visual reasoning with interactive online adaptation can yield robust, interpretable, and data-efficient sim-to-real transfer for complex manipulation tasks.

Abstract

Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass (CoM) and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https://phys2real.github.io/ .

Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation

TL;DR

Phys2Real tackles the sim-to-real manipulation gap by fusing visual physics priors from VLMs with online, ensemble-based adaptation to produce physics-conditioned policies. It constructs physically informed digital twins via real-to-sim scene reconstruction with Gaussian Splatting, trains parameter-conditioned policies with a two-stage online adaptation mechanism, and performs test-time fusion using inverse-variance weighting to integrate VLM priors and interaction data. The approach yields substantial gains over domain randomization on T-block and hammer pushing, particularly under varying CoM and off-center mass distributions, while preserving efficiency in execution. This work demonstrates that combining foundation-model visual reasoning with interactive online adaptation can yield robust, interpretable, and data-efficient sim-to-real transfer for complex manipulation tasks.

Abstract

Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass (CoM) and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https://phys2real.github.io/ .

Paper Structure

This paper contains 22 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Phys2Real is a real-to-sim-to-real pipeline for robotic manipulation that combines VLM-based physical parameter estimation with interaction-based adaptation through uncertainty-aware fusion. It comprises three stages: (I) real-to-sim: object reconstruction from segmented Gaussian Splats into simulation-ready meshes, (II) policy learning: reinforcement learning of policies conditioned on physical parameters such as the center of mass (CoM) of an object, and (III) sim-to-real transfer: uncertainty-aware fusion of VLM priors and interaction-based estimates for online adaptation.
  • Figure 2: Real-to-sim mesh reconstruction pipeline. Starting from a video of the object, we extract image frames and segment the target object using SAM-2. We then train a GSplat and extract a surface-aligned object-centric mesh using SuGaR guedon_sugar_2023. Finally, we generate a clean, watertight mesh, resulting in a simulation-ready asset.
  • Figure 3: Phys2Real policy training. The policy and adaptation models are trained in three stages, inspired by RMA kumar_rma_2021. Phase 1: the policy is conditioned on ground truth physical properties (e.g., CoM) from simulation, Phase 1.5: fine-tune with noisy physical properties to build robustness to downstream noisy estimates from the fused VLM and adaptation estimate. Phase 2: train an ensemble of $N$ encoders that take in a history of observations and actions. The variance of the ensemble estimates provides an epistemic uncertainty (Eq. \ref{['eq:epistemic']}). Each encoder outputs a physical property estimate and an associated uncertainty, representing the model's aleatoric uncertainty (Eq. \ref{['eq:aleatoric']}). At test time, we fuse the adaptation estimates and VLM estimates via inverse-variance weighting (Eq. \ref{['eq:fusion']}).
  • Figure 4: VLM priors for task-relevant physical parameters. We query a VLM (GPT-5 openai2025gpt5) to provide an estimated CoM and uncertainty given an image of the object. For each of $N$ images, we repeat the query $M$ times. We then calculate the average VLM estimate $\bar{\theta}_{\textrm{vlm}}$ along with the average uncertainty $\bar{\sigma}_{\textrm{vlm}}$. This is fused with the RMA estimate using Eq. (\ref{['eq:fusion']}).
  • Figure 5: T-block pushing task results. We compare the cumulative distribution functions (CDFs) of position errors at the end of rollouts from each policy, which show the percentage of experimental trials (y-axis) with position error less than or equal to a given value (x-axis). For an optimal policy, the curve should rise steeply and stay toward the left, indicating that most trials have low final error. We evaluate T-block pushing under two weight configurations: (a) weight at the top and (b) weight at the bottom. Overall, Phys2Real (green) consistently has low position error throughout the percentiles.
  • ...and 1 more figures