FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Bowen Wen; Wei Yang; Jan Kautz; Stan Birchfield

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Bowen Wen, Wei Yang, Jan Kautz, Stan Birchfield

TL;DR

FoundationPose proposes a unified framework for 6D pose estimation and tracking of novel objects that works under both model-based and model-free settings. It combines an object-centric neural implicit field for efficient RGBD rendering, a large-scale LLM-augmented synthetic data pipeline, and a transformer-based refinement-plus-hierarchical ranking architecture to achieve strong generalization without fine-tuning. The approach outperforms task-specific baselines across multiple public datasets and maintains competitive performance with instance-level methods, while enabling fast tracking via repeated refinement at runtime. These capabilities offer a scalable, test-time adaptable solution for real-world robotic and AR scenarios where object knowledge varies widely between instances and categories.

Abstract

We present FoundationPose, a unified foundation model for 6D object pose estimation and tracking, supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given, or a small number of reference images are captured. We bridge the gap between these two setups with a neural implicit representation that allows for effective novel view synthesis, keeping the downstream pose estimation modules invariant under the same unified framework. Strong generalizability is achieved via large-scale synthetic training, aided by a large language model (LLM), a novel transformer-based architecture, and contrastive learning formulation. Extensive evaluation on multiple public datasets involving challenging scenarios and objects indicate our unified approach outperforms existing methods specialized for each task by a large margin. In addition, it even achieves comparable results to instance-level methods despite the reduced assumptions. Project page: https://nvlabs.github.io/FoundationPose/

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

TL;DR

Abstract

Paper Structure (19 sections, 10 equations, 11 figures, 6 tables)

This paper contains 19 sections, 10 equations, 11 figures, 6 tables.

Introduction
Related Work
Approach
Language-aided Data Generation at Scale
Neural Object Modeling
Pose Hypothesis Generation
Pose Selection
Experiments
Dataset and Setup
Metric
Pose Estimation Comparison
Pose Tracking Comparison
Analysis
Conclusion
Performance on BOP Leaderboard
...and 4 more sections

Figures (11)

Figure 1: Our unified framework enables both 6D pose estimation and tracking for novel objects, supporting the model-based and model-free setups. On each of these four tasks, it outperforms prior work specially designed for the task ($\bullet$ indicates RGB-only; $\times$ indicates RGBD, like ours). The metric for each task is explained in detail in the experimental results.
Figure 2: Overview of our framework. To reduce manual efforts for large scale training, we developed a novel synthetic data generation pipeline by leveraging recent emerging techniques and resources including 3D model database, large language models and diffusion models (Sec. \ref{['sec:language']}). To bridge the gap between model-free and model-based setup, we leverage an object-centric neural field (Sec. \ref{['sec:nerf']}) for novel view RGBD rendering for subsequent render-and-compare. For pose estimation, we first initialize global poses uniformly around the object, which are then refined by the refinement network (Sec. \ref{['sec:refiner']}). Finally, we forward the refined poses to the pose selection module which predicts their scores. The pose with the best score is selected as output (Sec. \ref{['sec:ranking']}).
Figure 3: Top: Random texture blending proposed in FS6D he2022fs6d. Bottom: Our LLM-aided texture augmentation yields more realistic appearance. Leftmost is the original 3D assets. Text prompts are automatically generated by ChatGPT.
Figure 4: Pose ranking visualization. Our proposed hierarchical comparison leverages the global context among all pose hypotheses for a better overall trend prediction that aligns both shape and texture. The true best pose is annotated with red circle.
Figure 5: Qualitative comparison of pose estimation on LINEMOD dataset under the model-free setup. Images are cropped and zoomed-in for better visualization.
...and 6 more figures

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

TL;DR

Abstract

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Authors

TL;DR

Abstract

Table of Contents

Figures (11)