Table of Contents
Fetching ...

Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, Katerina Fragkiadaki

TL;DR

Act3D introduces a language-conditioned transformer that reasons directly in a continuous 3D feature field for multi-task 6-DoF robotic manipulation. It lifts 2D pretrained features into 3D, then iteratively samples 3D points with coarse-to-fine relative attention to build high-resolution action maps, achieving state-of-the-art results on RLBench with reduced compute. Thorough ablations validate the importance of relative 3D attention, 2D feature pretraining, and weight-tied coarse-to-fine stages, and real-world tests demonstrate practical transfer from a single RGB-D camera. Overall, Act3D advances spatially equivariant manipulation by combining 3D feature fields, multi-view perception, and language conditioning to improve generalization and efficiency.

Abstract

3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we introduce Act3D, a manipulation policy transformer that represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand. The model lifts 2D pre-trained features to 3D using sensed depth, and attends to them to compute features for sampled 3D points. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling. In this way, it efficiently computes 3D action maps of high spatial resolution. Act3D sets a new state-of-the-art in RL-Bench, an established manipulation benchmark, where it achieves 10% absolute improvement over the previous SOTA 2D multi-view policy on 74 RLBench tasks and 22% absolute improvement with 3x less compute over the previous SOTA 3D policy. We quantify the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions in ablative experiments. Code and videos are available on our project website: https://act3d.github.io/.

Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation

TL;DR

Act3D introduces a language-conditioned transformer that reasons directly in a continuous 3D feature field for multi-task 6-DoF robotic manipulation. It lifts 2D pretrained features into 3D, then iteratively samples 3D points with coarse-to-fine relative attention to build high-resolution action maps, achieving state-of-the-art results on RLBench with reduced compute. Thorough ablations validate the importance of relative 3D attention, 2D feature pretraining, and weight-tied coarse-to-fine stages, and real-world tests demonstrate practical transfer from a single RGB-D camera. Overall, Act3D advances spatially equivariant manipulation by combining 3D feature fields, multi-view perception, and language conditioning to improve generalization and efficiency.

Abstract

3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we introduce Act3D, a manipulation policy transformer that represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand. The model lifts 2D pre-trained features to 3D using sensed depth, and attends to them to compute features for sampled 3D points. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling. In this way, it efficiently computes 3D action maps of high spatial resolution. Act3D sets a new state-of-the-art in RL-Bench, an established manipulation benchmark, where it achieves 10% absolute improvement over the previous SOTA 2D multi-view policy on 74 RLBench tasks and 22% absolute improvement with 3x less compute over the previous SOTA 3D policy. We quantify the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions in ablative experiments. Code and videos are available on our project website: https://act3d.github.io/.
Paper Structure (33 sections, 4 equations, 8 figures, 3 tables)

This paper contains 33 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Act3D is a language-conditioned robot action transformer that learns 3D scene feature fields of arbitrary spatial resolution via recurrent coarse-to-fine 3D point sampling and featurization using relative-position attentions. Act3D featurizes multi-view RGB images with a pre-trained 2D CLIP backbone and lifts them in 3D using sensed depth. It predicts 3D location of the end-effector using classification of the 3D points of the robot's workspace, which preserves spatial equivariance of the scene to action mapping.
  • Figure 2: Tasks. We conduct experiments on 92 simulated tasks in RLBench james2020rlbench (only 10 shown), and 8 real-world tasks (only 5 shown).
  • Figure 3: Single-task performance. On 74 RLBench tasks across 9 categories, Act3D reaches 83% success rate, an absolute improvement of 10% over InstructRL liu2022instruction, prior SOTA in this setting.
  • Figure 5: Real-world setup.
  • Figure 7: PerAct shridhar2023perceiver tasks. We adopt the multi-task multi-variation setting from PerAct shridhar2023perceiver with 18 tasks and 249 unique variations across object placement, color, size, category, count, and shape.
  • ...and 3 more figures