Table of Contents
Fetching ...

ManiVID-3D: Generalizable View-Invariant Reinforcement Learning for Robotic Manipulation via Disentangled 3D Representations

Zheng Li, Pei Qu, Yufei Jia, Shihui Zhou, Haizhou Ge, Jiahang Cao, Jinni Zhou, Guyue Zhou, Jun Ma

TL;DR

ManiVID-3D tackles the problem of viewpoint generalization in 3D visual reinforcement learning for robotic manipulation by learning view-invariant representations from point clouds. It introduces ViewNet to align observations across arbitrary viewpoints without extrinsic calibration and employs a disentangled contrastive learning objective that separates view-invariant from view-dependent features, guided by a curriculum on a time-varying factor $\beta(t)$. A GPU-accelerated batch renderer enables large-scale training, achieving high throughput and enabling robust sim-to-real transfer. Empirical results across 10 simulated and 5 real tasks show significant improvements in success rate and parameter efficiency compared with state-of-the-art baselines, particularly under severe viewpoint changes. The work advances scalable, calibration-free 3D RL for manipulation in unstructured environments.

Abstract

Deploying visual reinforcement learning (RL) policies in real-world manipulation is often hindered by camera viewpoint changes. A policy trained from a fixed front-facing camera may fail when the camera is shifted -- an unavoidable situation in real-world settings where sensor placement is hard to manage appropriately. Existing methods often rely on precise camera calibration or struggle with large perspective changes. To address these limitations, we propose ManiVID-3D, a novel 3D RL architecture designed for robotic manipulation, which learns view-invariant representations through self-supervised disentangled feature learning. The framework incorporates ViewNet, a lightweight yet effective module that automatically aligns point cloud observations from arbitrary viewpoints into a unified spatial coordinate system without the need for extrinsic calibration. Additionally, we develop an efficient GPU-accelerated batch rendering module capable of processing over 5000 frames per second, enabling large-scale training for 3D visual RL at unprecedented speeds. Extensive evaluation across 10 simulated and 5 real-world tasks demonstrates that our approach achieves a 40.6% higher success rate than state-of-the-art methods under viewpoint variations while using 80% fewer parameters. The system's robustness to severe perspective changes and strong sim-to-real performance highlight the effectiveness of learning geometrically consistent representations for scalable robotic manipulation in unstructured environments.

ManiVID-3D: Generalizable View-Invariant Reinforcement Learning for Robotic Manipulation via Disentangled 3D Representations

TL;DR

ManiVID-3D tackles the problem of viewpoint generalization in 3D visual reinforcement learning for robotic manipulation by learning view-invariant representations from point clouds. It introduces ViewNet to align observations across arbitrary viewpoints without extrinsic calibration and employs a disentangled contrastive learning objective that separates view-invariant from view-dependent features, guided by a curriculum on a time-varying factor . A GPU-accelerated batch renderer enables large-scale training, achieving high throughput and enabling robust sim-to-real transfer. Empirical results across 10 simulated and 5 real tasks show significant improvements in success rate and parameter efficiency compared with state-of-the-art baselines, particularly under severe viewpoint changes. The work advances scalable, calibration-free 3D RL for manipulation in unstructured environments.

Abstract

Deploying visual reinforcement learning (RL) policies in real-world manipulation is often hindered by camera viewpoint changes. A policy trained from a fixed front-facing camera may fail when the camera is shifted -- an unavoidable situation in real-world settings where sensor placement is hard to manage appropriately. Existing methods often rely on precise camera calibration or struggle with large perspective changes. To address these limitations, we propose ManiVID-3D, a novel 3D RL architecture designed for robotic manipulation, which learns view-invariant representations through self-supervised disentangled feature learning. The framework incorporates ViewNet, a lightweight yet effective module that automatically aligns point cloud observations from arbitrary viewpoints into a unified spatial coordinate system without the need for extrinsic calibration. Additionally, we develop an efficient GPU-accelerated batch rendering module capable of processing over 5000 frames per second, enabling large-scale training for 3D visual RL at unprecedented speeds. Extensive evaluation across 10 simulated and 5 real-world tasks demonstrates that our approach achieves a 40.6% higher success rate than state-of-the-art methods under viewpoint variations while using 80% fewer parameters. The system's robustness to severe perspective changes and strong sim-to-real performance highlight the effectiveness of learning geometrically consistent representations for scalable robotic manipulation in unstructured environments.

Paper Structure

This paper contains 18 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: ManiVID-3D. Our method achieves robust multi-domain generalization for manipulation tasks with superior viewpoint adaptation and sim-to-real transferability, while significantly reducing computational costs.
  • Figure 2: Overview of ManiVID-3D. (A) In the training phase, our method consists of two key components: (a) Pretrained ViewNet aligns arbitrary-viewpoint clouds collected in simulation to a unified frame without extrinsic calibration; (b) A disentanglement encoder extracts view-invariant features that are used to train manipulation policies with strong cross-view generalization. (B) In the deployment phase, we introduce a multi-stage processing pipeline specifically designed for camera-coordinate point clouds to bridge the sim-to-real domain gap, enabling zero-shot transfer to real-world deployment.
  • Figure 3: Simulation snapshots. Reference view and evaluation view at different angular offsets for selected tasks.
  • Figure 4: Robustness to (a) viewpoint variation and (b) reference viewpoint selection. ManiVID-3D maintains consistently strong performance across varying degrees of view offsets and different reference viewpoint choices, whereas Maniwhere exhibits a clear performance degradation trend.
  • Figure 5: RL training curves. ManiVID-3D shows superior convergence to Maniwhere in most tasks.
  • ...and 2 more figures