Table of Contents
Fetching ...

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Bowen Wen, Wei Yang, Jan Kautz, Stan Birchfield

TL;DR

FoundationPose proposes a unified framework for 6D pose estimation and tracking of novel objects that works under both model-based and model-free settings. It combines an object-centric neural implicit field for efficient RGBD rendering, a large-scale LLM-augmented synthetic data pipeline, and a transformer-based refinement-plus-hierarchical ranking architecture to achieve strong generalization without fine-tuning. The approach outperforms task-specific baselines across multiple public datasets and maintains competitive performance with instance-level methods, while enabling fast tracking via repeated refinement at runtime. These capabilities offer a scalable, test-time adaptable solution for real-world robotic and AR scenarios where object knowledge varies widely between instances and categories.

Abstract

We present FoundationPose, a unified foundation model for 6D object pose estimation and tracking, supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given, or a small number of reference images are captured. We bridge the gap between these two setups with a neural implicit representation that allows for effective novel view synthesis, keeping the downstream pose estimation modules invariant under the same unified framework. Strong generalizability is achieved via large-scale synthetic training, aided by a large language model (LLM), a novel transformer-based architecture, and contrastive learning formulation. Extensive evaluation on multiple public datasets involving challenging scenarios and objects indicate our unified approach outperforms existing methods specialized for each task by a large margin. In addition, it even achieves comparable results to instance-level methods despite the reduced assumptions. Project page: https://nvlabs.github.io/FoundationPose/

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

TL;DR

FoundationPose proposes a unified framework for 6D pose estimation and tracking of novel objects that works under both model-based and model-free settings. It combines an object-centric neural implicit field for efficient RGBD rendering, a large-scale LLM-augmented synthetic data pipeline, and a transformer-based refinement-plus-hierarchical ranking architecture to achieve strong generalization without fine-tuning. The approach outperforms task-specific baselines across multiple public datasets and maintains competitive performance with instance-level methods, while enabling fast tracking via repeated refinement at runtime. These capabilities offer a scalable, test-time adaptable solution for real-world robotic and AR scenarios where object knowledge varies widely between instances and categories.

Abstract

We present FoundationPose, a unified foundation model for 6D object pose estimation and tracking, supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given, or a small number of reference images are captured. We bridge the gap between these two setups with a neural implicit representation that allows for effective novel view synthesis, keeping the downstream pose estimation modules invariant under the same unified framework. Strong generalizability is achieved via large-scale synthetic training, aided by a large language model (LLM), a novel transformer-based architecture, and contrastive learning formulation. Extensive evaluation on multiple public datasets involving challenging scenarios and objects indicate our unified approach outperforms existing methods specialized for each task by a large margin. In addition, it even achieves comparable results to instance-level methods despite the reduced assumptions. Project page: https://nvlabs.github.io/FoundationPose/
Paper Structure (19 sections, 10 equations, 11 figures, 6 tables)

This paper contains 19 sections, 10 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Our unified framework enables both 6D pose estimation and tracking for novel objects, supporting the model-based and model-free setups. On each of these four tasks, it outperforms prior work specially designed for the task ($\bullet$ indicates RGB-only; $\times$ indicates RGBD, like ours). The metric for each task is explained in detail in the experimental results.
  • Figure 2: Overview of our framework. To reduce manual efforts for large scale training, we developed a novel synthetic data generation pipeline by leveraging recent emerging techniques and resources including 3D model database, large language models and diffusion models (Sec. \ref{['sec:language']}). To bridge the gap between model-free and model-based setup, we leverage an object-centric neural field (Sec. \ref{['sec:nerf']}) for novel view RGBD rendering for subsequent render-and-compare. For pose estimation, we first initialize global poses uniformly around the object, which are then refined by the refinement network (Sec. \ref{['sec:refiner']}). Finally, we forward the refined poses to the pose selection module which predicts their scores. The pose with the best score is selected as output (Sec. \ref{['sec:ranking']}).
  • Figure 3: Top: Random texture blending proposed in FS6D he2022fs6d. Bottom: Our LLM-aided texture augmentation yields more realistic appearance. Leftmost is the original 3D assets. Text prompts are automatically generated by ChatGPT.
  • Figure 4: Pose ranking visualization. Our proposed hierarchical comparison leverages the global context among all pose hypotheses for a better overall trend prediction that aligns both shape and texture. The true best pose is annotated with red circle.
  • Figure 5: Qualitative comparison of pose estimation on LINEMOD dataset under the model-free setup. Images are cropped and zoomed-in for better visualization.
  • ...and 6 more figures