Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning
Zhecheng Yuan, Tianming Wei, Shuiqi Cheng, Gu Zhang, Yuanpei Chen, Huazhe Xu
TL;DR
The paper tackles the challenge of generalizing visuomotor policies to open-world visual disturbances. It introduces Maniwhere, a framework that fuses a two-view, STN-augmented visual encoder with a multi-view contrastive objective and a curriculum domain randomization schedule to stabilize training and enable zero-shot sim2real transfer across diverse hardware. Key contributions include the LManiwhere objective combining InfoNCE and feature alignment, the incorporation of perspective STN for cross-view alignment, and comprehensive evaluation across 8 tasks showing superior generalization over baselines in both simulation and real robots. Depth-enabled transfer and cross-embodiment generalization are demonstrated, underscoring Maniwhere’s practical impact for robust, plug-and-play robotic manipulation in the wild.
Abstract
Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose \textbf{Maniwhere}, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design 8 tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere's strong visual generalization and sim2real transfer abilities across 3 hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods. Videos are provided at https://gemcollector.github.io/maniwhere/.
