Table of Contents
Fetching ...

UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation

Yinqiao Wang, Hao Xu, Pheng-Ann Heng, Chi-Wing Fu

TL;DR

UniHOPE tackles monocular 3D hand-object pose estimation by unifying hand-only and hand-object scenarios in a single framework. It introduces an internal object switcher to dynamically decide whether to estimate object pose and a grasp-aware fusion module to selectively leverage object cues based on grasping status. To improve robustness under occlusion, it uses a diffusion-based generative de-occluder to create paired de-occluded hands and applies multi-level feature enhancement with self-distillation to learn occlusion-invariant representations. Extensive experiments on DexYCB, HO3D, and FreiHAND demonstrate state-of-the-art performance across hand-only, hand-object, and occlusion settings, highlighting strong generalization and practical impact for AR/VR and HCI applications.

Abstract

Estimating the 3D pose of hand and potential hand-held object from monocular images is a longstanding challenge. Yet, existing methods are specialized, focusing on either bare-hand or hand interacting with object. No method can flexibly handle both scenarios and their performance degrades when applied to the other scenario. In this paper, we propose UniHOPE, a unified approach for general 3D hand-object pose estimation, flexibly adapting both scenarios. Technically, we design a grasp-aware feature fusion module to integrate hand-object features with an object switcher to dynamically control the hand-object pose estimation according to grasping status. Further, to uplift the robustness of hand pose estimation regardless of object presence, we generate realistic de-occluded image pairs to train the model to learn object-induced hand occlusions, and formulate multi-level feature enhancement techniques for learning occlusion-invariant features. Extensive experiments on three commonly-used benchmarks demonstrate UniHOPE's SOTA performance in addressing hand-only and hand-object scenarios. Code will be released on https://github.com/JoyboyWang/UniHOPE_Pytorch.

UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation

TL;DR

UniHOPE tackles monocular 3D hand-object pose estimation by unifying hand-only and hand-object scenarios in a single framework. It introduces an internal object switcher to dynamically decide whether to estimate object pose and a grasp-aware fusion module to selectively leverage object cues based on grasping status. To improve robustness under occlusion, it uses a diffusion-based generative de-occluder to create paired de-occluded hands and applies multi-level feature enhancement with self-distillation to learn occlusion-invariant representations. Extensive experiments on DexYCB, HO3D, and FreiHAND demonstrate state-of-the-art performance across hand-only, hand-object, and occlusion settings, highlighting strong generalization and practical impact for AR/VR and HCI applications.

Abstract

Estimating the 3D pose of hand and potential hand-held object from monocular images is a longstanding challenge. Yet, existing methods are specialized, focusing on either bare-hand or hand interacting with object. No method can flexibly handle both scenarios and their performance degrades when applied to the other scenario. In this paper, we propose UniHOPE, a unified approach for general 3D hand-object pose estimation, flexibly adapting both scenarios. Technically, we design a grasp-aware feature fusion module to integrate hand-object features with an object switcher to dynamically control the hand-object pose estimation according to grasping status. Further, to uplift the robustness of hand pose estimation regardless of object presence, we generate realistic de-occluded image pairs to train the model to learn object-induced hand occlusions, and formulate multi-level feature enhancement techniques for learning occlusion-invariant features. Extensive experiments on three commonly-used benchmarks demonstrate UniHOPE's SOTA performance in addressing hand-only and hand-object scenarios. Code will be released on https://github.com/JoyboyWang/UniHOPE_Pytorch.

Paper Structure

This paper contains 33 sections, 11 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Existing approaches (top) for 3D hand pose estimation are either Hand Pose Estimation (HPE), which predicts hand pose only, or Hand-Object Pose Estimation (HOPE), which assumes hand-held object. Our novel UniHOPE approach (bottom) offers flexibility and robustness to handle both scenes in a unified manner.
  • Figure 2: Our UniHOPE framework. (i) We first de-occlude hand images occluded by objects to form pairs, conditioned on the depth map and hand-object mask, with adaptive selection of control strength to produce high-quality samples; (ii) to accommodate both hand-only and hand-object scenes, our object switcher dynamically controls the object output by predicting grasping status, which guides the feature fusion module to eliminate irrelevant object features; and (iii) to robustly estimate hand pose, our multi-level feature enhancement techniques utilize paired data to learn occlusion-invariant hand features.
  • Figure 3: De-occluded examples in various occlusion conditions.
  • Figure 4: Visualization of our adaptive control strength adjustment.
  • Figure 5: Qualitative comparison between our method and SOTA HPE/HOPE methods on hand-only/hand-object scenarios across different datasets. The first and second rows in each example denote the original view and another view, respectively, for better comparison.
  • ...and 1 more figures