Table of Contents
Fetching ...

HIPPo: Harnessing Image-to-3D Priors for Model-free Zero-shot 6D Pose Estimation

Yibo Liu, Zhaodong Jiang, Binbin Xu, Guile Wu, Yuan Ren, Tongtong Cao, Bingbing Liu, Rui Heng Yang, Amir Rasouli, Jinjun Shan

TL;DR

HIPPo tackles model-free zero-shot 6D pose estimation for unseen objects by leveraging image-to-3D priors from diffusion models. It introduces HIPPo Dreamer, an instant image-to-3D mesh generator based on Wonder3D and MASt3R, with a scale recovery step to align the generated mesh to real objects. A measurement-guided mesh update replaces diffusion priors with online observations, while a FoundationPose-based pose refinement network estimates the 6D pose from rendered and observed RGB-D data. A mesh-update module triggers updates at key viewpoints to improve fidelity, enabling accurate pose estimation from a single first frame and robust tracking for immediate robotic use. Across YCB-Video and LM-O, HIPPo excels when prior reference images are scarce, outperforming state-of-the-art methods under limited priors while maintaining a complete, ready-to-use 3D model for robotic tasks.

Abstract

This work focuses on model-free zero-shot 6D object pose estimation for robotics applications. While existing methods can estimate the precise 6D pose of objects, they heavily rely on curated CAD models or reference images, the preparation of which is a time-consuming and labor-intensive process. Moreover, in real-world scenarios, 3D models or reference images may not be available in advance and instant robot reaction is desired. In this work, we propose a novel framework named HIPPo, which eliminates the need for curated CAD models and reference images by harnessing image-to-3D priors from Diffusion Models, enabling model-free zero-shot 6D pose estimation. Specifically, we construct HIPPo Dreamer, a rapid image-to-mesh model built on a multiview Diffusion Model and a 3D reconstruction foundation model. Our HIPPo Dreamer can generate a 3D mesh of any unseen objects from a single glance in just a few seconds. Then, as more observations are acquired, we propose to continuously refine the diffusion prior mesh model by joint optimization of object geometry and appearance. This is achieved by a measurement-guided scheme that gradually replaces the plausible diffusion priors with more reliable online observations. Consequently, HIPPo can instantly estimate and track the 6D pose of a novel object and maintain a complete mesh for immediate robotic applications. Thorough experiments on various benchmarks show that HIPPo outperforms state-of-the-art methods in 6D object pose estimation when prior reference images are limited.

HIPPo: Harnessing Image-to-3D Priors for Model-free Zero-shot 6D Pose Estimation

TL;DR

HIPPo tackles model-free zero-shot 6D pose estimation for unseen objects by leveraging image-to-3D priors from diffusion models. It introduces HIPPo Dreamer, an instant image-to-3D mesh generator based on Wonder3D and MASt3R, with a scale recovery step to align the generated mesh to real objects. A measurement-guided mesh update replaces diffusion priors with online observations, while a FoundationPose-based pose refinement network estimates the 6D pose from rendered and observed RGB-D data. A mesh-update module triggers updates at key viewpoints to improve fidelity, enabling accurate pose estimation from a single first frame and robust tracking for immediate robotic use. Across YCB-Video and LM-O, HIPPo excels when prior reference images are scarce, outperforming state-of-the-art methods under limited priors while maintaining a complete, ready-to-use 3D model for robotic tasks.

Abstract

This work focuses on model-free zero-shot 6D object pose estimation for robotics applications. While existing methods can estimate the precise 6D pose of objects, they heavily rely on curated CAD models or reference images, the preparation of which is a time-consuming and labor-intensive process. Moreover, in real-world scenarios, 3D models or reference images may not be available in advance and instant robot reaction is desired. In this work, we propose a novel framework named HIPPo, which eliminates the need for curated CAD models and reference images by harnessing image-to-3D priors from Diffusion Models, enabling model-free zero-shot 6D pose estimation. Specifically, we construct HIPPo Dreamer, a rapid image-to-mesh model built on a multiview Diffusion Model and a 3D reconstruction foundation model. Our HIPPo Dreamer can generate a 3D mesh of any unseen objects from a single glance in just a few seconds. Then, as more observations are acquired, we propose to continuously refine the diffusion prior mesh model by joint optimization of object geometry and appearance. This is achieved by a measurement-guided scheme that gradually replaces the plausible diffusion priors with more reliable online observations. Consequently, HIPPo can instantly estimate and track the 6D pose of a novel object and maintain a complete mesh for immediate robotic applications. Thorough experiments on various benchmarks show that HIPPo outperforms state-of-the-art methods in 6D object pose estimation when prior reference images are limited.

Paper Structure

This paper contains 15 sections, 2 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Compared to existing SOTA 6D pose estimation methods fpsam6dgigapose, HIPPo eliminates the need for a textured 3D model or reference images in advance, while also optimizing the reference 3D model online. Compared to existing object SLAM methods xu2019midbundlesdf, HIPPo sustains a complete 3D model from the first glance of the object, enabling immediate robotic applications.
  • Figure 2: Overview of HIPPo. Given a video consisting of RGB-D frames, Grounding DINO grounding is first applied to segment the object based on a prompt. Next, the proposed HIPPo Dreamer, built on a multiview Diffusion Model and a 3D reconstruction foundation model, generates a 3D mesh of the object from the first detected frame in a few seconds. Then, the diffusion prior mesh is provided to the pose estimation network to estimate the 6D pose in real time. Meanwhile, the mesh optimization module monitors viewpoint changes through a predefined viewpoint sphere and triggers mesh optimization when the viewpoint varies dramatically. The module then replaces the diffusion prior with more reliable appearance and geometry from online measurements.
  • Figure 3: Comparison of the vanilla MASt3R (a)(b)(c) and our modified MASt3R (d). (a): A low threshold preserves too many background points. (b): A high threshold results in an incomplete model by masking out some foreground object points. (c): Even with careful fine-tuning, artifacts may remain around the object, affecting the judgment of its scale. (d): Our modified MASt3R generates artifact-free 3D models without requiring fine-tuning of the hyperparameter.
  • Figure 4: An illustration of the HIPPo Dreamer pipeline.
  • Figure 5: An illustration of the viewpoint sphere. Each circle on the sphere represents a viewpoint. By monitoring the current viewpoint on the sphere, key frames representing dramatic viewpoint changes are recognized, triggering a mesh update at these key frames.
  • ...and 7 more figures