Point Cloud Models Improve Visual Robustness in Robotic Learners

Skand Peri; Iain Lee; Chanho Kim; Li Fuxin; Tucker Hermans; Stefan Lee

Point Cloud Models Improve Visual Robustness in Robotic Learners

Skand Peri, Iain Lee, Chanho Kim, Li Fuxin, Tucker Hermans, Stefan Lee

TL;DR

The paper addresses robustness of vision-based robotic control to visual perturbations and shows that XYZ-RGB point-cloud inputs enable stronger robustness and faster learning than RGB-D inputs. It introduces Point Cloud World Models (PCWM), a model-based RL framework that operates on partial point clouds with an RSSM latent world model and a Dreamer-like policy, achieving improved sample efficiency and robustness across manipulation tasks. Key findings include substantial robustness gains to viewpoint, FoV, and lighting changes, and faster adaptation when finetuning in perturbed environments, highlighting the practical value of 3D scene reasoning for robotic learners. Overall, PCWM demonstrates that leveraging explicit 3D geometric representations can meaningfully enhance performance and transfer in real-world robotic control.

Abstract

Visual control policies can encounter significant performance degradation when visual conditions like lighting or camera position differ from those seen during training -- often exhibiting sharp declines in capability even for minor differences. In this work, we examine robustness to a suite of these types of visual changes for RGB-D and point cloud based visual control policies. To perform these experiments on both model-free and model-based reinforcement learners, we introduce a novel Point Cloud World Model (PCWM) and point cloud based control policies. Our experiments show that policies that explicitly encode point clouds are significantly more robust than their RGB-D counterparts. Further, we find our proposed PCWM significantly outperforms prior works in terms of sample efficiency during training. Taken together, these results suggest reasoning about the 3D scene through point clouds can improve performance, reduce learning time, and increase robustness for robotic learners. Project Webpage: https://pvskand.github.io/projects/PCWM

Point Cloud Models Improve Visual Robustness in Robotic Learners

TL;DR

Abstract

Paper Structure (16 sections, 3 equations, 5 figures, 3 tables)

This paper contains 16 sections, 3 equations, 5 figures, 3 tables.

Introduction
Related Work
Point Cloud World Models
World Model
Policy Learning
Experimental Setup
Results
Point Cloud World Models (PCWMs) can be more sample efficient learners than analogous RGB-D models.
Point cloud-based policies are more robust to changes in visual conditions than analogous RGB-D policies.
PCWM adapt more quickly than RGB-D counterparts when trained further in viewpoint perturbed environments.
Discussion & Limitations
Conclusion
Acknowledgements
Appendix
Hyperparameters
...and 1 more sections

Figures (5)

Figure 1: Motivating Example. We compare DreamerV3, a state-of-the-art RL model that is trained on RGB-D inputs with our Point Cloud World Model (PCWM) on a simple task of lifting a cube. We find the point clouds are significantly more robust to viewpoint changes compared to RGB-D.
Figure 2: PCWM training: Given a sequence of $T$ partial point cloud observations $o_{1:T}$, we encode them using a PointConv encoder. For each timestep $t$, we compute a posterior stochastic latent $z_t$ using an encoding of $o_t$ and hidden state $h_t$ that encodes the history. The hidden state is further used to compute the prior latent $\hat{z}_t$ which is used to predict multi-step rewards over a horizon $H$ providing supervision for the world model alongside a KL-loss for temporal consistency See Sec. \ref{['sec:model']}.
Figure 3: Task performance: We report training curves for six manipulation tasks. Our proposed PCWM either matches or outperforms baselines in all settings -- demonstrating strong sample efficiency gains in several tasks. PCWM is truncated after achieving task success for Pick & Place tasks (top row).
Figure 4: Fine-grained robustness analysis for the Clutter Pick task. Example frames from each condition are shown above policy performance plots. Grey shaded backgrounds indicate the original training environment. We find RGB-D models generalize poorly to new viewpoints or FoV in this setting.
Figure 5: Comparison of point cloud encoders: PointConv wu2019pointconv consistently shows greater sample efficiency as compared to PointNet Qi2016PointNetDL in the StackCube and OpenCabinetDoor tasks.

Point Cloud Models Improve Visual Robustness in Robotic Learners

TL;DR

Abstract

Point Cloud Models Improve Visual Robustness in Robotic Learners

Authors

TL;DR

Abstract

Table of Contents

Figures (5)