HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

Haozhe Qi; Chen Zhao; Mathieu Salzmann; Alexander Mathis

HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

Haozhe Qi, Chen Zhao, Mathieu Salzmann, Alexander Mathis

TL;DR

This paper tackles monocular 3D hand-object pose estimation under severe occlusion. It proposes HOISDF, which uses a global Signed Distance Field as an implicit 3D shape representation to guide pose regression, addressing limitations of explicit intermediate representations. The method comprises a global SDF learning module and a field-guided pose regression module that sample informative points, augment features with field densities, and apply cross-field attention to resolve occlusions, achieving state-of-the-art results on DexYCB and HO3Dv2. The work demonstrates that implicit global shape information can robustly constrain hand-object poses and enables end-to-end training with real-time inference, offering practical impact for AR, robotics, and neuroscience research.

Abstract

Human hands are highly articulated and versatile at handling objects. Jointly estimating the 3D poses of a hand and the object it manipulates from a monocular camera is challenging due to frequent occlusions. Thus, existing methods often rely on intermediate 3D shape representations to increase performance. These representations are typically explicit, such as 3D point clouds or meshes, and thus provide information in the direct surroundings of the intermediate hand pose estimate. To address this, we introduce HOISDF, a Signed Distance Field (SDF) guided hand-object pose estimation network, which jointly exploits hand and object SDFs to provide a global, implicit representation over the complete reconstruction volume. Specifically, the role of the SDFs is threefold: equip the visual encoder with implicit shape information, help to encode hand-object interactions, and guide the hand and object pose regression via SDF-based sampling and by augmenting the feature representations. We show that HOISDF achieves state-of-the-art results on hand-object pose estimation benchmarks (DexYCB and HO3Dv2). Code is available at https://github.com/amathislab/HOISDF

HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

TL;DR

Abstract

Paper Structure (38 sections, 9 equations, 9 figures, 12 tables)

This paper contains 38 sections, 9 equations, 9 figures, 12 tables.

Introduction
Related Work
3D Hand-Object Pose Estimation
Distance Fields in Hand-Object Interactions
Attention-based Methods
HOISDF
Global Signed Distance Field Learning
Integrating Field Information: Field-guided Pose Regression
Field-informed Point Sampling
Field-based Point Feature Augmentation
Cross Fields Hand-Object Interaction
Feature Enhancement with Point-wise Attention
Point-wise Pose Regression
Experiments
Datasets and Evaluation Metrics
...and 23 more sections

Figures (9)

Figure 1: Conceptual advantage of the SDF-guided model over existing approaches. Our model utilizes Signed Distance Fields (SDF) to provide global and dense constraints for hand-object pose estimation. In contrast to direct lifting and coarse-to-fine methods, which struggle to refine poor initial predictions, the distance field yields global cues not limited to areas near an initial prediction.
Figure 2: Overall pipeline of HOISDF. HOISDF has two parts: A global signed distance field learning module and a field-guided pose regression module. The global signed distance field learning module regresses the hand object signed distances as the intermediate representation and encodes the 3D shape information into the image backbone through implicit field learning. The field-guided pose regression module uses global field information to filter and augment the point features as well as guiding hand-object interaction. Those enhanced point features are then sent to regress hand and object poses using point-wise attention.
Figure 3: Visualization of the intermediate query points on DexYCB testset. The darkness of the query points reflects the predicted distance from the query point to the hand (in blue) and object (in green) surfaces. The intermediate SDF representations can capture the GT 3D hand and object shapes. HOISDF effectively uses the robust global clues from SDFs to deal well with various objects and hand movements as well as their mutual occlusions.
Figure 4: Qualitative comparisons between HOISDF and lin2023harmoniouswang2023interacting on DexYCB testset. HOISDF effectively uses robust global clues near the hand and object to deal well with various objects and severe occlusions.
Figure F1: Details of hand pose regression of HOISDF.
...and 4 more figures

HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

TL;DR

Abstract

HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields

Authors

TL;DR

Abstract

Table of Contents

Figures (9)