Table of Contents
Fetching ...

Generalized Hand-Object Pose Estimation with Occlusion Awareness

Hui Yang, Wei Sun, Jian Liu, Jian Xiao Tao Xie, Hossein Rahmani, Ajmal Saeed mian, Nicu Sebe, Gim Hee Lee

Abstract

Generalized 3D hand-object pose estimation from a single RGB image remains challenging due to the large variations in object appearances and interaction patterns, especially under heavy occlusion. We propose GenHOI, a framework for generalized hand-object pose estimation with occlusion awareness. GenHOI integrates hierarchical semantic knowledge with hand priors to enhance model generalization under challenging occlusion conditions. Specifically, we introduce a hierarchical semantic prompt that encodes object states, hand configurations, and interaction patterns via textual descriptions. This enables the model to learn abstract high-level representations of hand-object interactions for generalization to unseen objects and novel interactions while compensating for missing or ambiguous visual cues. To enable robust occlusion reasoning, we adopt a multi-modal masked modeling strategy over RGB images, predicted point clouds, and textual descriptions. Moreover, we leverage hand priors as stable spatial references to extract implicit interaction constraints. This allows reliable pose inference even under significant variations in object shapes and interaction patterns. Extensive experiments on the challenging DexYCB and HO3Dv2 benchmarks demonstrate that our method achieves state-of-the-art performance in hand-object pose estimation.

Generalized Hand-Object Pose Estimation with Occlusion Awareness

Abstract

Generalized 3D hand-object pose estimation from a single RGB image remains challenging due to the large variations in object appearances and interaction patterns, especially under heavy occlusion. We propose GenHOI, a framework for generalized hand-object pose estimation with occlusion awareness. GenHOI integrates hierarchical semantic knowledge with hand priors to enhance model generalization under challenging occlusion conditions. Specifically, we introduce a hierarchical semantic prompt that encodes object states, hand configurations, and interaction patterns via textual descriptions. This enables the model to learn abstract high-level representations of hand-object interactions for generalization to unseen objects and novel interactions while compensating for missing or ambiguous visual cues. To enable robust occlusion reasoning, we adopt a multi-modal masked modeling strategy over RGB images, predicted point clouds, and textual descriptions. Moreover, we leverage hand priors as stable spatial references to extract implicit interaction constraints. This allows reliable pose inference even under significant variations in object shapes and interaction patterns. Extensive experiments on the challenging DexYCB and HO3Dv2 benchmarks demonstrate that our method achieves state-of-the-art performance in hand-object pose estimation.
Paper Structure (19 sections, 12 equations, 7 figures, 13 tables)

This paper contains 19 sections, 12 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Illustration of task challenges and key performance comparisons. Hand-object pose estimation (left) faces the challenge of generalizing from training set to unseen test set. Additionally, severe hand-object occlusion makes this task more complicated. Our method shows stronger generalization (bottom: qualitative comparison with HFL-Net lin2023harmonious on unseen objects in front and back views, and red dotted circles marking focus areas) and achieves the best performance (right) on the challenging generalized DexYCB S3 split chao2021dexycb. These results confirm effectiveness in generalized settings and under occlusion.
  • Figure 2: GenHOI framework, which consists of three key components: (1) Hierarchical Textual Semantics. Given RGB images and hierarchical templates with the object, hand, and interaction levels, we employ InstructBLIP to generate hierarchical textual descriptions. (2) Multi-Modal Mask Modeling. Given the RGB image, textual description, and corresponding hand-object point cloud, we first apply a modality-specific masking strategy to the inputs. The masked inputs are processed by the cross-modal embedding module, which fuses visual, geometric, and textual features to produce representations used for reconstruction and pose estimation. The dashed boxes indicate that masking and reconstruction are used only during training. (3) Hand Prior Guided Pose Estimation. The cross-modal features learned in the previous stage are aggregated to a robust representation. Using these fused features, we first estimate hand pose parameters, which then serve as reliable priors for object pose reasoning.
  • Figure 3: Qualitative comparison of hand-object pose estimation on unseen objects under the DexYCB S3 split chao2021dexycb. Front and back indicate the front and rear views, respectively. Red dotted circles highlight regions where other methods produce less accurate pose estimates than our method. This demonstrates the superior generalization and robustness of our method on unseen objects.
  • Figure 4: Qualitative comparison on the HO3Dv2 dataset hampali2020honnotate. The top-left example show results on the unseen object (“019 pitcher base”), while the remaining examples correspond to objects seen during training.
  • Figure A: Qualitative comparison between our method and HOISDF qi2024hoisdf on the DexYCB S0 split chao2021dexycb.
  • ...and 2 more figures