Table of Contents
Fetching ...

Best of Sim and Real: Decoupled Visuomotor Manipulation via Learning Control in Simulation and Perception in Real

Jialei Huang, Zhaoheng Yin, Yingdong Hu, Shuo Wang, Xingyu Lin, Yang Gao

TL;DR

The paper tackles the sim-to-real gap in robot manipulation by decoupling perception and control: learning universal control skills in physics-rich simulation with privileged state, and learning a lightweight visual bridge in the real world to map observations to the controller’s input. The two-stage Best of Sim and Real framework leverages systematic domain randomization during simulation and minimal real-world demonstrations (10–20) to achieve strong data efficiency and robust spatial generalization, outperforming end-to-end baselines. Key contributions include a two-stage training paradigm, the use of a pretrained vision backbone (e.g., DINOv2) for the perception bridge, and extensive ablations demonstrating the importance of multi-scale features and progressive fusion. The approach significantly reduces real-world data requirements and provides modular, deployable policies with demonstrated generalization to object positions and scales beyond the training distribution, highlighting practical benefits for real-world robotic manipulation.

Abstract

Sim-to-real transfer remains a fundamental challenge in robot manipulation due to the entanglement of perception and control in end-to-end learning. We present a decoupled framework that learns each component where it is most reliable: control policies are trained in simulation with privileged state to master spatial layouts and manipulation dynamics, while perception is adapted only at deployment to bridge real observations to the frozen control policy. Our key insight is that control strategies and action patterns are universal across environments and can be learned in simulation through systematic randomization, while perception is inherently domain-specific and must be learned where visual observations are authentic. Unlike existing end-to-end approaches that require extensive real-world data, our method achieves strong performance with only 10-20 real demonstrations by reducing the complex sim-to-real problem to a structured perception alignment task. We validate our approach on tabletop manipulation tasks, demonstrating superior data efficiency and out-of-distribution generalization compared to end-to-end baselines. The learned policies successfully handle object positions and scales beyond the training distribution, confirming that decoupling perception from control fundamentally improves sim-to-real transfer.

Best of Sim and Real: Decoupled Visuomotor Manipulation via Learning Control in Simulation and Perception in Real

TL;DR

The paper tackles the sim-to-real gap in robot manipulation by decoupling perception and control: learning universal control skills in physics-rich simulation with privileged state, and learning a lightweight visual bridge in the real world to map observations to the controller’s input. The two-stage Best of Sim and Real framework leverages systematic domain randomization during simulation and minimal real-world demonstrations (10–20) to achieve strong data efficiency and robust spatial generalization, outperforming end-to-end baselines. Key contributions include a two-stage training paradigm, the use of a pretrained vision backbone (e.g., DINOv2) for the perception bridge, and extensive ablations demonstrating the importance of multi-scale features and progressive fusion. The approach significantly reduces real-world data requirements and provides modular, deployable policies with demonstrated generalization to object positions and scales beyond the training distribution, highlighting practical benefits for real-world robotic manipulation.

Abstract

Sim-to-real transfer remains a fundamental challenge in robot manipulation due to the entanglement of perception and control in end-to-end learning. We present a decoupled framework that learns each component where it is most reliable: control policies are trained in simulation with privileged state to master spatial layouts and manipulation dynamics, while perception is adapted only at deployment to bridge real observations to the frozen control policy. Our key insight is that control strategies and action patterns are universal across environments and can be learned in simulation through systematic randomization, while perception is inherently domain-specific and must be learned where visual observations are authentic. Unlike existing end-to-end approaches that require extensive real-world data, our method achieves strong performance with only 10-20 real demonstrations by reducing the complex sim-to-real problem to a structured perception alignment task. We validate our approach on tabletop manipulation tasks, demonstrating superior data efficiency and out-of-distribution generalization compared to end-to-end baselines. The learned policies successfully handle object positions and scales beyond the training distribution, confirming that decoupling perception from control fundamentally improves sim-to-real transfer.

Paper Structure

This paper contains 17 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of our Best of Sim and Real (BSR) framework. (a) Stage 1: Control learning with privileged state in physics simulation, where the policy learns robust action patterns through systematic domain randomization. (b) Stage 2: Visual bridge learning in the real world, where a lightweight network maps image observations to the frozen control policy's input space using expert demonstrations stored in a replay buffer.
  • Figure 2: Architecture of the visual bridge network. Multi-layer pretrained features from a frozen vision backbone are progressively refined through adaptive layers and residual blocks, then combined into low-dimensional features via an MLP head before being passed to the frozen control policy.
  • Figure 3: Manipulation tasks used for evaluation. (a) Stacking Cube: Pick and place a cube onto a target platform, requiring precise grasp and placement within a 20$\times$20cm workspace. (b) Opening Drawer: Localize the handle, grasp, and pull to open the drawer by at least 15cm, demanding accurate visual servoing and force control. (c) Closing Door: Push a hinged door from 90° open to fully closed while maintaining continuous contact, testing the policy's ability to handle constrained motion and contact dynamics.
  • Figure 4: Success rate as a function of real-world demonstrations. Our method achieves strong performance with just 10–20 demonstrations, while baselines require substantially more data or fail to reach comparable performance.
  • Figure 5: Spatial visualization of task completion scores for Stacking Cube as the workspace expands from 20$\times$20cm training region (black dashed box) to 40$\times$40cm evaluation area. Heatmaps show interpolated completion scores (0-4 scale), with circles indicating successful trials and crosses marking failures. Our method maintains high performance (ID: 75%, OOD: 35%) across the extended workspace, while baselines show rapid degradation beyond the training boundary.
  • ...and 1 more figures