Table of Contents
Fetching ...

ActivePose: Active 6D Object Pose Estimation and Tracking for Robotic Manipulation

Sheng Liu, Zhe Li, Weiheng Wang, Han Sun, Heng Zhang, Hongpeng Chen, Yusen Qin, Arash Ajoudani, Yizhao Wang

TL;DR

An active pose estimation pipeline that combines a Vision-Language Model (VLM) with "robotic imagination" to dynamically detect and resolve ambiguities in real time is proposed and significantly outperforms classical baselines.

Abstract

Accurate 6-DoF object pose estimation and tracking are critical for reliable robotic manipulation. However, zero-shot methods often fail under viewpoint-induced ambiguities and fixed-camera setups struggle when objects move or become self-occluded. To address these challenges, we propose an active pose estimation pipeline that combines a Vision-Language Model (VLM) with "robotic imagination" to dynamically detect and resolve ambiguities in real time. In an offline stage, we render a dense set of views of the CAD model, compute the FoundationPose entropy for each view, and construct a geometric-aware prompt that includes low-entropy (unambiguous) and high-entropy (ambiguous) examples. At runtime, the system: (1) queries the VLM on the live image for an ambiguity score; (2) if ambiguity is detected, imagines a discrete set of candidate camera poses by rendering virtual views, scores each based on a weighted combination of VLM ambiguity probability and FoundationPose entropy, and then moves the camera to the Next-Best-View (NBV) to obtain a disambiguated pose estimation. Furthermore, since moving objects may leave the camera's field of view, we introduce an active pose tracking module: a diffusion-policy trained via imitation learning, which generates camera trajectories that preserve object visibility and minimize pose ambiguity. Experiments in simulation and real-world show that our approach significantly outperforms classical baselines.

ActivePose: Active 6D Object Pose Estimation and Tracking for Robotic Manipulation

TL;DR

An active pose estimation pipeline that combines a Vision-Language Model (VLM) with "robotic imagination" to dynamically detect and resolve ambiguities in real time is proposed and significantly outperforms classical baselines.

Abstract

Accurate 6-DoF object pose estimation and tracking are critical for reliable robotic manipulation. However, zero-shot methods often fail under viewpoint-induced ambiguities and fixed-camera setups struggle when objects move or become self-occluded. To address these challenges, we propose an active pose estimation pipeline that combines a Vision-Language Model (VLM) with "robotic imagination" to dynamically detect and resolve ambiguities in real time. In an offline stage, we render a dense set of views of the CAD model, compute the FoundationPose entropy for each view, and construct a geometric-aware prompt that includes low-entropy (unambiguous) and high-entropy (ambiguous) examples. At runtime, the system: (1) queries the VLM on the live image for an ambiguity score; (2) if ambiguity is detected, imagines a discrete set of candidate camera poses by rendering virtual views, scores each based on a weighted combination of VLM ambiguity probability and FoundationPose entropy, and then moves the camera to the Next-Best-View (NBV) to obtain a disambiguated pose estimation. Furthermore, since moving objects may leave the camera's field of view, we introduce an active pose tracking module: a diffusion-policy trained via imitation learning, which generates camera trajectories that preserve object visibility and minimize pose ambiguity. Experiments in simulation and real-world show that our approach significantly outperforms classical baselines.

Paper Structure

This paper contains 16 sections, 4 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: Dual-arm experimental setup. The left arm serves as the sensing arm and carries a wrist-mounted RGB-D camera, while the right arm serves as the manipulation arm equipped with a parallel-jaw gripper.
  • Figure 2: Active pose estimation. (a) Offline: render canonical CAD views, compute the hypothesis entropy of FoundationPose, and build a geometry-aware prompt from low-/high-entropy exemplars. (b) Online: compute VLM ambiguity $p_{\mathrm{amb}}$ for the current view and trigger disambiguation when $p_{\mathrm{amb}}>\tau$. (c) Rank feasible candidate views using rendered imagined observations and the fused score, execute the selected NBV, and repeat up to budget $L$.
  • Figure 3: Active pose tracking. The policy encodes the current observation $O_t$, denoises over $K_d$ reverse-diffusion steps to generate a horizon of continuous SE(3) poses, and executes the last $k_h$ poses in a receding-horizon loop.
  • Figure 4: CAD models of the experimental objects (Obj. 1--3 from MP6D; Obj. 4 for assembly) and an example ambiguous view.
  • Figure 5: Example of active pose estimation. An ambiguous initial view triggers NBV selection; moving to the selected viewpoint yields an unambiguous 6D pose estimate, shown in simulation (top) and real-robot trials (bottom).
  • ...and 1 more figures