Table of Contents
Fetching ...

PointPatchRL -- Masked Reconstruction Improves Reinforcement Learning on Point Clouds

Balázs Gyenes, Nikolai Franke, Philipp Becker, Gerhard Neumann

TL;DR

This work introduces PointPatchRL (PPRL), a method for RL on point clouds that builds on the common paradigm of dividing point clouds into overlapping patches, tokenizing them, and processing the tokens with transformers, and provides significant improvements compared with other point-cloud processing architectures previously used for RL.

Abstract

Perceiving the environment via cameras is crucial for Reinforcement Learning (RL) in robotics. While images are a convenient form of representation, they often complicate extracting important geometric details, especially with varying geometries or deformable objects. In contrast, point clouds naturally represent this geometry and easily integrate color and positional data from multiple camera views. However, while deep learning on point clouds has seen many recent successes, RL on point clouds is under-researched, with only the simplest encoder architecture considered in the literature. We introduce PointPatchRL (PPRL), a method for RL on point clouds that builds on the common paradigm of dividing point clouds into overlapping patches, tokenizing them, and processing the tokens with transformers. PPRL provides significant improvements compared with other point-cloud processing architectures previously used for RL. We then complement PPRL with masked reconstruction for representation learning and show that our method outperforms strong model-free and model-based baselines on image observations in complex manipulation tasks containing deformable objects and variations in target object geometry. Videos and code are available at https://alrhub.github.io/pprl-website

PointPatchRL -- Masked Reconstruction Improves Reinforcement Learning on Point Clouds

TL;DR

This work introduces PointPatchRL (PPRL), a method for RL on point clouds that builds on the common paradigm of dividing point clouds into overlapping patches, tokenizing them, and processing the tokens with transformers, and provides significant improvements compared with other point-cloud processing architectures previously used for RL.

Abstract

Perceiving the environment via cameras is crucial for Reinforcement Learning (RL) in robotics. While images are a convenient form of representation, they often complicate extracting important geometric details, especially with varying geometries or deformable objects. In contrast, point clouds naturally represent this geometry and easily integrate color and positional data from multiple camera views. However, while deep learning on point clouds has seen many recent successes, RL on point clouds is under-researched, with only the simplest encoder architecture considered in the literature. We introduce PointPatchRL (PPRL), a method for RL on point clouds that builds on the common paradigm of dividing point clouds into overlapping patches, tokenizing them, and processing the tokens with transformers. PPRL provides significant improvements compared with other point-cloud processing architectures previously used for RL. We then complement PPRL with masked reconstruction for representation learning and show that our method outperforms strong model-free and model-based baselines on image observations in complex manipulation tasks containing deformable objects and variations in target object geometry. Videos and code are available at https://alrhub.github.io/pprl-website

Paper Structure

This paper contains 26 sections, 3 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Schematic of PointPatchRL(PPRL) with an auxiliary masked reconstruction loss trained end-to-end for . Top: We train a patching-based tokenizer and transformer encoder to compute a latent embedding for the policy and state-action-value estimation using sequence pooling. The entire pipeline learns using the critic's gradients while we detach the latent embedding before providing it to the actor. Bottom: We augment the policy learning using masked reconstruction. Using the token sorting, masking, and transformer decoder introduced by PointGPT chen_pointgpt_2023, we minimize the Chamfer distance for the point's positions and the mean squared reconstruction error for colors. This auxiliary loss provides an additional training signal for the shared encoder and tokenizer, improving RL performance and sample efficiency.
  • Figure 2: Average success rates and $95\%$ Bootstrap Confidence Intervals agarwal2021deep for all methods on 2 sofaenv and 4 ManiSkill2 environments. Our method, + Aux, achieves top performance on $5$ of $6$ environments. On DeflectSpheres, ThreadInHole, and PushChair, without representation learning achieves roughly the same success as + Aux, demonstrating the effectiveness of our neural network architecture even without auxiliary learning objectives. On OpenCabinetDrawer and OpenCabinetDoor, two challenging tasks with non-trivial variations in scene geometry, adding an auxiliary reconstruction loss significantly improves learning, and is required for solving the task. On TurnFaucet, DrQ-v2 outperforms our method, potentially due to the availability of a static camera, making the learning task easier for image-based methods.
  • Figure 3: Visualization of successful + Aux trajectories on the OpenCabinetDoor environment from a static rendering camera that the agent does not have access to. Each column shows a single episode. Agents trained with + Aux adapt to varying geometries, including handle size and orientation, and whether the door opens to the left or right. Our method is able to coordinate the movements of the gripper and the base and generalize well over these factors.
  • Figure 4: Average success rates and $95\%$ Bootstrap Confidence Intervals for agents trained with + Aux on selected environments with varying numbers of object models. Although there is a general trend that more object models reduce success rates, the effect is not always strong, showing that the encoder can generalize well. OpenCabinetDrawer performs approximately the same for $5$, $10$, or $15$ object models, and only begins to decrease for $25$.
  • Figure 5: Average success rates and $95\%$ Bootstrap Confidence Intervals for agents trained with and without color reconstruction loss, on all environments with color point cloud observations. When training without color reconstruction loss, color is still observed, but only the signal from encourages the agent to condition on color features. In particular the hard Cabinet tasks profit from explicitly reconstructing the color.
  • ...and 9 more figures