Table of Contents
Fetching ...

Beyond 'Templates': Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View

Jinyu Zhang, Haitao Lin, Jiashu Hou, Xiangyang Xue, Yanwei Fu

TL;DR

This work introduces a unified category-agnostic framework that estimates an object’s $6D$ pose, size, and dense shape from a single RGB-D image without test-time priors. It fuses dense 2D foundation-model features with partial 3D point clouds in a Transformer encoder augmented by a Mixture-of-Experts, and uses parallel decoders to perform pose–size regression and shape reconstruction in one forward pass. Trained solely on synthetic data from 149 SOPE categories, the method achieves state-of-the-art results on seen objects and remarkable zero-shot generalization to unseen objects across four benchmarks (SOPE, ROPE, ObjaversePose, HANDAL), while running at 28 FPS. The approach advances open-set 6D understanding by delivering real-time, category-agnostic perception suitable for robotics and embodied AI, and it introduces ObjaversePose to enrich synthetic data for category-agnostic estimation.

Abstract

Estimating an object's 6D pose, size, and shape from visual input is a fundamental problem in computer vision, with critical applications in robotic grasping and manipulation. Existing methods either rely on object-specific priors such as CAD models or templates, or suffer from limited generalization across categories due to pose-shape entanglement and multi-stage pipelines. In this work, we propose a unified, category-agnostic framework that simultaneously predicts 6D pose, size, and dense shape from a single RGB-D image, without requiring templates, CAD models, or category labels at test time. Our model fuses dense 2D features from vision foundation models with partial 3D point clouds using a Transformer encoder enhanced by a Mixture-of-Experts, and employs parallel decoders for pose-size estimation and shape reconstruction, achieving real-time inference at 28 FPS. Trained solely on synthetic data from 149 categories in the SOPE dataset, our framework is evaluated on four diverse benchmarks SOPE, ROPE, ObjaversePose, and HANDAL, spanning over 300 categories. It achieves state-of-the-art accuracy on seen categories while demonstrating remarkably strong zero-shot generalization to unseen real-world objects, establishing a new standard for open-set 6D understanding in robotics and embodied AI.

Beyond 'Templates': Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View

TL;DR

This work introduces a unified category-agnostic framework that estimates an object’s pose, size, and dense shape from a single RGB-D image without test-time priors. It fuses dense 2D foundation-model features with partial 3D point clouds in a Transformer encoder augmented by a Mixture-of-Experts, and uses parallel decoders to perform pose–size regression and shape reconstruction in one forward pass. Trained solely on synthetic data from 149 SOPE categories, the method achieves state-of-the-art results on seen objects and remarkable zero-shot generalization to unseen objects across four benchmarks (SOPE, ROPE, ObjaversePose, HANDAL), while running at 28 FPS. The approach advances open-set 6D understanding by delivering real-time, category-agnostic perception suitable for robotics and embodied AI, and it introduces ObjaversePose to enrich synthetic data for category-agnostic estimation.

Abstract

Estimating an object's 6D pose, size, and shape from visual input is a fundamental problem in computer vision, with critical applications in robotic grasping and manipulation. Existing methods either rely on object-specific priors such as CAD models or templates, or suffer from limited generalization across categories due to pose-shape entanglement and multi-stage pipelines. In this work, we propose a unified, category-agnostic framework that simultaneously predicts 6D pose, size, and dense shape from a single RGB-D image, without requiring templates, CAD models, or category labels at test time. Our model fuses dense 2D features from vision foundation models with partial 3D point clouds using a Transformer encoder enhanced by a Mixture-of-Experts, and employs parallel decoders for pose-size estimation and shape reconstruction, achieving real-time inference at 28 FPS. Trained solely on synthetic data from 149 categories in the SOPE dataset, our framework is evaluated on four diverse benchmarks SOPE, ROPE, ObjaversePose, and HANDAL, spanning over 300 categories. It achieves state-of-the-art accuracy on seen categories while demonstrating remarkably strong zero-shot generalization to unseen real-world objects, establishing a new standard for open-set 6D understanding in robotics and embodied AI.

Paper Structure

This paper contains 19 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Results on diverse domain datasets using our end-to-end regression-based framework. Trained exclusively on a large synthetic dataset, our model generalizes effectively to unseen object categories across multiple real-world domains, including daily-life scenes, autonomous driving, robotic manipulation, and egocentric video data.
  • Figure 2: Framework Overview. Given a cropped RGB image and its corresponding segmented point cloud, the model first extracts dense 2D features using RADIOv2.5 heinrich2025radiov2, which are concatenated with 3D point coordinates. A DGCNN processes the fused input to produce keypoint-aware features, forming object tokens. These tokens are passed through a Transformer encoder with a Mixture-of-Experts (MoE) module to produce a global object representation. Two parallel decoder branches predict (i) the 6D pose and size via direct regression, and (ii) the object shape in two stages: a coarse shape prediction followed by refinement using fused points. The entire pipeline is fully end-to-end and operates in real time.
  • Figure 3:
  • Figure 4: Qualitative results on ROPE. We show the input RGB image, ground-truth pose, poses from GenPose++ and ours, and a comparison between the predicted and ground-truth shapes.
  • Figure 5: Some specular and transparent objects from ROPE(Top) and SOPE (Bottom).
  • ...and 3 more figures