Best of Sim and Real: Decoupled Visuomotor Manipulation via Learning Control in Simulation and Perception in Real
Jialei Huang, Zhaoheng Yin, Yingdong Hu, Shuo Wang, Xingyu Lin, Yang Gao
TL;DR
The paper tackles the sim-to-real gap in robot manipulation by decoupling perception and control: learning universal control skills in physics-rich simulation with privileged state, and learning a lightweight visual bridge in the real world to map observations to the controller’s input. The two-stage Best of Sim and Real framework leverages systematic domain randomization during simulation and minimal real-world demonstrations (10–20) to achieve strong data efficiency and robust spatial generalization, outperforming end-to-end baselines. Key contributions include a two-stage training paradigm, the use of a pretrained vision backbone (e.g., DINOv2) for the perception bridge, and extensive ablations demonstrating the importance of multi-scale features and progressive fusion. The approach significantly reduces real-world data requirements and provides modular, deployable policies with demonstrated generalization to object positions and scales beyond the training distribution, highlighting practical benefits for real-world robotic manipulation.
Abstract
Sim-to-real transfer remains a fundamental challenge in robot manipulation due to the entanglement of perception and control in end-to-end learning. We present a decoupled framework that learns each component where it is most reliable: control policies are trained in simulation with privileged state to master spatial layouts and manipulation dynamics, while perception is adapted only at deployment to bridge real observations to the frozen control policy. Our key insight is that control strategies and action patterns are universal across environments and can be learned in simulation through systematic randomization, while perception is inherently domain-specific and must be learned where visual observations are authentic. Unlike existing end-to-end approaches that require extensive real-world data, our method achieves strong performance with only 10-20 real demonstrations by reducing the complex sim-to-real problem to a structured perception alignment task. We validate our approach on tabletop manipulation tasks, demonstrating superior data efficiency and out-of-distribution generalization compared to end-to-end baselines. The learned policies successfully handle object positions and scales beyond the training distribution, confirming that decoupling perception from control fundamentally improves sim-to-real transfer.
