Beyond 'Templates': Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View
Jinyu Zhang, Haitao Lin, Jiashu Hou, Xiangyang Xue, Yanwei Fu
TL;DR
This work introduces a unified category-agnostic framework that estimates an object’s $6D$ pose, size, and dense shape from a single RGB-D image without test-time priors. It fuses dense 2D foundation-model features with partial 3D point clouds in a Transformer encoder augmented by a Mixture-of-Experts, and uses parallel decoders to perform pose–size regression and shape reconstruction in one forward pass. Trained solely on synthetic data from 149 SOPE categories, the method achieves state-of-the-art results on seen objects and remarkable zero-shot generalization to unseen objects across four benchmarks (SOPE, ROPE, ObjaversePose, HANDAL), while running at 28 FPS. The approach advances open-set 6D understanding by delivering real-time, category-agnostic perception suitable for robotics and embodied AI, and it introduces ObjaversePose to enrich synthetic data for category-agnostic estimation.
Abstract
Estimating an object's 6D pose, size, and shape from visual input is a fundamental problem in computer vision, with critical applications in robotic grasping and manipulation. Existing methods either rely on object-specific priors such as CAD models or templates, or suffer from limited generalization across categories due to pose-shape entanglement and multi-stage pipelines. In this work, we propose a unified, category-agnostic framework that simultaneously predicts 6D pose, size, and dense shape from a single RGB-D image, without requiring templates, CAD models, or category labels at test time. Our model fuses dense 2D features from vision foundation models with partial 3D point clouds using a Transformer encoder enhanced by a Mixture-of-Experts, and employs parallel decoders for pose-size estimation and shape reconstruction, achieving real-time inference at 28 FPS. Trained solely on synthetic data from 149 categories in the SOPE dataset, our framework is evaluated on four diverse benchmarks SOPE, ROPE, ObjaversePose, and HANDAL, spanning over 300 categories. It achieves state-of-the-art accuracy on seen categories while demonstrating remarkably strong zero-shot generalization to unseen real-world objects, establishing a new standard for open-set 6D understanding in robotics and embodied AI.
