Table of Contents
Fetching ...

FP3: A 3D Foundation Policy for Robotic Manipulation

Rujia Yang, Geng Chen, Chuan Wen, Yang Gao

TL;DR

FP3 introduces a 3D foundation policy for robotic manipulation by leveraging a 1.3B diffusion Transformer that fuses two-view point clouds, language, and proprioception. Pre-trained on 60k DROID trajectories, FP3 achieves data-efficient fine-tuning and strong zero-shot generalization to unseen objects and environments, outperforming 2D baselines. The model's effectiveness is demonstrated on real-robot tasks with limited demonstrations, with ablations confirming the value of 3D inputs, scale, and diverse pre-training data. Limitations include the need for larger 3D pre-training datasets and more advanced language conditioning, suggesting future work integrating 2D features and richer VLMs.

Abstract

Following its success in natural language processing and computer vision, foundation models that are pre-trained on large-scale multi-task datasets have also shown great potential in robotics. However, most existing robot foundation models rely solely on 2D image observations, ignoring 3D geometric information, which is essential for robots to perceive and reason about the 3D world. In this paper, we introduce FP3, a first large-scale 3D foundation policy model for robotic manipulation. FP3 builds on a scalable diffusion transformer architecture and is pre-trained on 60k trajectories with point cloud observations. With the model design and diverse pre-training data, FP3 can be efficiently fine-tuned for downstream tasks while exhibiting strong generalization capabilities. Experiments on real robots demonstrate that with only 80 demonstrations, FP3 is able to learn a new task with over 90% success rates in novel environments with unseen objects, significantly surpassing existing robot foundation models.

FP3: A 3D Foundation Policy for Robotic Manipulation

TL;DR

FP3 introduces a 3D foundation policy for robotic manipulation by leveraging a 1.3B diffusion Transformer that fuses two-view point clouds, language, and proprioception. Pre-trained on 60k DROID trajectories, FP3 achieves data-efficient fine-tuning and strong zero-shot generalization to unseen objects and environments, outperforming 2D baselines. The model's effectiveness is demonstrated on real-robot tasks with limited demonstrations, with ablations confirming the value of 3D inputs, scale, and diverse pre-training data. Limitations include the need for larger 3D pre-training datasets and more advanced language conditioning, suggesting future work integrating 2D features and richer VLMs.

Abstract

Following its success in natural language processing and computer vision, foundation models that are pre-trained on large-scale multi-task datasets have also shown great potential in robotics. However, most existing robot foundation models rely solely on 2D image observations, ignoring 3D geometric information, which is essential for robots to perceive and reason about the 3D world. In this paper, we introduce FP3, a first large-scale 3D foundation policy model for robotic manipulation. FP3 builds on a scalable diffusion transformer architecture and is pre-trained on 60k trajectories with point cloud observations. With the model design and diverse pre-training data, FP3 can be efficiently fine-tuned for downstream tasks while exhibiting strong generalization capabilities. Experiments on real robots demonstrate that with only 80 demonstrations, FP3 is able to learn a new task with over 90% success rates in novel environments with unseen objects, significantly surpassing existing robot foundation models.

Paper Structure

This paper contains 24 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of 3D Foundation Policy (FP3), a 1.3B 3D point cloud-based language-visuomotor policy pre-trained on 60k episodes from the DROID dataset droid. FP3 supports data-efficient fine-tuning for downstream tasks, while demonstrating superior generalizability to unseen environments and novel objects.
  • Figure 2: FP3 architecture. Each camera view's point cloud observation $\mathbf{P}_t^i$ (with history length of two) is encoded with a Uni3D ViT-L uni3d encoder. The language instruction $\ell_t$ is embedded with a frozen CLIP clip model. The Transformer encoder fuses multi-modal input embeddings to latent tokens, while the Transformer decoder takes in the noise actions and leverages adaLN ditbrock2018largekarras2021style blocks to integrate the latent tokens generated by the encoder, predicting denoised action chunks.
  • Figure 3: Task illustrations. We evaluate our model on four downstream tasks: Fold Towel, Clean Table, Stand up Cup, and Pour Water.
  • Figure 4: Visualizations of post-training environments and in-the-wild evaluations. The green boxes represent successful steps, while the red boxes represent failed ones. FP3 generalize well to all unseen environments and new objects, while Diffusion Policy often fails to recognize the target object or misses the target position.
  • Figure 5: Generalization evaluation. We evaluate FP3 and baseline policies on a diverse set of tasks, covering different axes of generalization, including lighting, camera view, distractor, object and background. FP3 achieves outstanding performance in all generalization evaluation settings.
  • ...and 4 more figures