Table of Contents
Fetching ...

OMNI-PoseX: A Fast Vision Model for 6D Object Pose Estimation in Embodied Tasks

Michael Zhang, Wei Ying, Fangwen Chen, Shifeng Bai, Hanwen Kang

Abstract

Accurate 6D object pose estimation is a fundamental capability for embodied agents, yet remains highly challenging in open-world environments. Many existing methods often rely on closed-set assumptions or geometry-agnostic regression schemes, limiting their generalization, stability, and real-time applicability in robotic systems. We present OMNI-PoseX, a vision foundation model that introduces a novel network architecture unifying open-vocabulary perception with an SO(3)-aware reflected flow matching pose predictor. The architecture decouples object-level understanding from geometry-consistent rotation inference, and employs a lightweight multi-modal fusion strategy that conditions rotation-sensitive geometric features on compact semantic embeddings, enabling efficient and stable 6D pose estimation. To enhance robustness and generalization, the model is trained on large-scale 6D pose datasets, leveraging broad object diversity, viewpoint variation, and scene complexity to build a scalable open-world pose backbone. Comprehensive evaluations across benchmark pose estimation, ablation studies, zero-shot generalization, and system-level robotic grasping integration demonstrate the effectiveness of OMNI-PoseX. The OMNI-PoseX achieves SOTA pose accuracy and real-time efficiency, while delivering geometrically consistent predictions that enable reliable grasping of diverse, previously unseen objects.

OMNI-PoseX: A Fast Vision Model for 6D Object Pose Estimation in Embodied Tasks

Abstract

Accurate 6D object pose estimation is a fundamental capability for embodied agents, yet remains highly challenging in open-world environments. Many existing methods often rely on closed-set assumptions or geometry-agnostic regression schemes, limiting their generalization, stability, and real-time applicability in robotic systems. We present OMNI-PoseX, a vision foundation model that introduces a novel network architecture unifying open-vocabulary perception with an SO(3)-aware reflected flow matching pose predictor. The architecture decouples object-level understanding from geometry-consistent rotation inference, and employs a lightweight multi-modal fusion strategy that conditions rotation-sensitive geometric features on compact semantic embeddings, enabling efficient and stable 6D pose estimation. To enhance robustness and generalization, the model is trained on large-scale 6D pose datasets, leveraging broad object diversity, viewpoint variation, and scene complexity to build a scalable open-world pose backbone. Comprehensive evaluations across benchmark pose estimation, ablation studies, zero-shot generalization, and system-level robotic grasping integration demonstrate the effectiveness of OMNI-PoseX. The OMNI-PoseX achieves SOTA pose accuracy and real-time efficiency, while delivering geometrically consistent predictions that enable reliable grasping of diverse, previously unseen objects.

Paper Structure

This paper contains 16 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: OMNI-PoseX is a vision foundation model for 6D pose estimation in embodied tasks. It predicts object categories, 3D bounding boxes, and 6D poses. Trained on large-scale open-source datasets, OMNI-PoseX generalizes robustly across objects and embodied scenarios.
  • Figure C1: Network Architecture of the OMNI-PoseX.
  • Figure C2: The figures illustrate the results of our model processing. (a) is the original RGB image. (b) is the mask segmentation result. (c) is the OMNI-PoseX prediction result.
  • Figure D1: Samples of unseen objects in Issac-Sim.
  • Figure D2: Real-world demonstrations of OMNI-PoseX in daily manipulation tasks. Top: Category-level zero-shot grasping across unseen object instances, validating cross-category generalization under geometric and appearance variations. Bottom: Articulated object manipulation (drawer and cabinet door), requiring stable 6D pose tracking to support constrained motion and contact-consistent execution.