Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

Zhecheng Yuan; Tianming Wei; Shuiqi Cheng; Gu Zhang; Yuanpei Chen; Huazhe Xu

Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

Zhecheng Yuan, Tianming Wei, Shuiqi Cheng, Gu Zhang, Yuanpei Chen, Huazhe Xu

TL;DR

The paper tackles the challenge of generalizing visuomotor policies to open-world visual disturbances. It introduces Maniwhere, a framework that fuses a two-view, STN-augmented visual encoder with a multi-view contrastive objective and a curriculum domain randomization schedule to stabilize training and enable zero-shot sim2real transfer across diverse hardware. Key contributions include the LManiwhere objective combining InfoNCE and feature alignment, the incorporation of perspective STN for cross-view alignment, and comprehensive evaluation across 8 tasks showing superior generalization over baselines in both simulation and real robots. Depth-enabled transfer and cross-embodiment generalization are demonstrated, underscoring Maniwhere’s practical impact for robust, plug-and-play robotic manipulation in the wild.

Abstract

Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose \textbf{Maniwhere}, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design 8 tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere's strong visual generalization and sim2real transfer abilities across 3 hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods. Videos are provided at https://gemcollector.github.io/maniwhere/.

Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

TL;DR

Abstract

Paper Structure (29 sections, 5 equations, 13 figures, 11 tables)

This paper contains 29 sections, 5 equations, 13 figures, 11 tables.

Introduction
Method
Multi-View Representation Objective
Curriculum Domain Randomization
Inserting the STN Module
Experiments
Experiment Setup
Baselines
Simulation Results
Real-World Experiments
Ablations
Qualitative Analysis
Imitation Learning
Related Work
Conclusion and Limitations
...and 14 more sections

Figures (13)

Figure 1: Maniwhere. Our framework is capable of training visuomotor robots that generalize effectively across various types of visual changes. Furthermore, Maniwhere can adeptly handle diverse real-world visual scenarios with various appearances and camera views in a zero-shot manner.
Figure 2: Overview of Maniwhere. The agent takes two images as input captured from different viewpoints with data augmentation and then passes them through a visual encoder containing an STN module to obtain visual representations. Subsequently, we employ multi-view representation learning to train the visual encoder while using a curriculum learning approach to stabilize the entire RL training process. Once the agent is trained in simulation, we can perform sim2real transfer.
Figure 3: (a). Generalization results of visual appearances. Maniwhere exhibits minimal performance drop when encountering variations in visual appearance, whereas MV-MWM is unable to handle these visual scenarios. (b). STN visualization. STN is capable of transforming views from various other perspectives to align closely with the fixed view used during training.
Figure 4: Real-world setup. Our real-world experiments encompass 3 types of robotic arms, 2 dexterous hands, and various tasks including articulated objects and bi-manual manipulation.
Figure 5: Real-world snapshots. Real-world experiments under different visual conditions.
...and 8 more figures

Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

TL;DR

Abstract

Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (13)