Object Pose Transformer: Unifying Unseen Object Pose Estimation

Weihang Li; Lorenzo Garattoni; Fabien Despinoy; Nassir Navab; Benjamin Busam

Object Pose Transformer: Unifying Unseen Object Pose Estimation

Weihang Li, Lorenzo Garattoni, Fabien Despinoy, Nassir Navab, Benjamin Busam

Abstract

Learning model-free object pose estimation for unseen instances remains a fundamental challenge in 3D vision. Existing methods typically fall into two disjoint paradigms: category-level approaches predict absolute poses in a canonical space but rely on predefined taxonomies, while relative pose methods estimate cross-view transformations but cannot recover single-view absolute pose. In this work, we propose Object Pose Transformer (\ours{}), a unified feed-forward framework that bridges these paradigms through task factorization within a single model. \ours{} jointly predicts depth, point maps, camera parameters, and normalized object coordinates (NOCS) from RGB inputs, enabling both category-level absolute SA(3) pose and unseen-object relative SE(3) pose. Our approach leverages contrastive object-centric latent embeddings for canonicalization without requiring semantic labels at inference time, and uses point maps as a camera-space representation to enable multi-view relative geometric reasoning. Through cross-frame feature interaction and shared object embeddings, our model leverages relative geometric consistency across views to improve absolute pose estimation, reducing ambiguity in single-view predictions. Furthermore, \ours{} is camera-agnostic, learning camera intrinsics on-the-fly and supporting optional depth input for metric-scale recovery, while remaining fully functional in RGB-only settings. Extensive experiments on diverse benchmarks (NOCS, HouseCat6D, Omni6DPose, Toyota-Light) demonstrate state-of-the-art performance in both absolute and relative pose estimation tasks within a single unified architecture.

Object Pose Transformer: Unifying Unseen Object Pose Estimation

Abstract

Paper Structure (27 sections, 8 equations, 4 figures, 5 tables)

This paper contains 27 sections, 8 equations, 4 figures, 5 tables.

Introduction
Related Works
Category-level, Model-free Absolute Pose Estimation.
Feed-forward Geometry Transformers.
Relative Pose Estimation and Point Cloud Registration.
Metric Scale Recovery.
Semantic Priors and Category-agnostic Canonicalization.
Taxonomy of Unseen Pose Estimation and OPT-Pose.
Method
Design and Task Factorization
Problem Formulation
Multiview Geometry and Feature Transformer
Keypoint-level Multi-view Feature Fusion
Canonical Correspondences & Absolute Poses
Relative Poses from Depth and Point Map
...and 12 more sections

Figures (4)

Figure 1: Unified unseen object pose estimation. OPT-Pose utilizes a feed-forward transformer to predict point map, depth, NOCS, and camera parameters. Existing category-level methods predict canonical absolute 9-DoF SA(3) poses (equivalent to Depth + NOCS), but require predefined category labels and calibrated cameras. Relative pose methods align unseen objects across views in 6-DoF SE(3) (equivalent to Pointmap + Depth), but do not support single-view absolute pose prediction. OPT-Pose enables the simultaneous recovery of both unseen-object relative and category-level absolute poses (right-most column) for flexible single or multi-view RGB or RGB-D input, without the need for CAD models or semantic labels.
Figure 2: OPT-Pose overview. A multiview transformer aggregates image tokens and emits predictions from light heads: camera parameters, depth, and point maps for camera-space geometry; a multi-view-keypoint-centric module fuses RGB and 3D features to discover object keypoints, predict NOCS coordinates, and build an object latent embedding. Absolute pose (SA(3)) and relative pose (SE(3)) are recovered in a single forward pass. Optional sensor depth provides metric scale, while the system remains fully functional in RGB-only mode.
Figure 3: Qualitative result in relative pose estimation. We compare different object instances across different scenes with Oryon corsetti2024open. Visualization shows that our OPT-pose can estimate the relative object poses across different objects and scenes.
Figure 4: Qualitative result in absolute pose estimation with RGB-D input. We showcase some difficult instances in comparison with AG-Poselin2024instance. We zoom in on the difficult object categories, the shoe and the box, for better visualization.

Object Pose Transformer: Unifying Unseen Object Pose Estimation

Abstract

Object Pose Transformer: Unifying Unseen Object Pose Estimation

Authors

Abstract

Table of Contents

Figures (4)