Table of Contents
Fetching ...

Affordance-Guided Diffusion Prior for 3D Hand Reconstruction

Naru Suzuki, Takehiko Ohkawa, Tatsuro Banno, Jihyun Lee, Ryosuke Furuta, Yoichi Sato

TL;DR

This work tackles 3D hand pose reconstruction under severe occlusion by using affordance-aware textual descriptions as context. It introduces an affordance-guided diffusion prior that learns a distribution of plausible hand poses conditioned on descriptions generated from a vision-language model (VLM) and summarized by an LLM, with the pose variable $x_0=\boldsymbol{\theta} \in \mathbb{R}^{15\times 3}$ and diffusion steps up to $T=1000$. Starting from initial HaMeR estimates, the model refines occluded joints by aligning with the affordance-conditioned prior while maintaining visible joints via 2D keypoint fitting. The approach yields clear improvements on HOGraspNet, outperforming regression and unconditional diffusion baselines, and offers controllable, interpretable refinements through affordance descriptions, demonstrating strong potential for robust HOI understanding and dexterous manipulation.

Abstract

How can we reconstruct 3D hand poses when large portions of the hand are heavily occluded by itself or by objects? Humans often resolve such ambiguities by leveraging contextual knowledge -- such as affordances, where an object's shape and function suggest how the object is typically grasped. Inspired by this observation, we propose a generative prior for hand pose refinement guided by affordance-aware textual descriptions of hand-object interactions (HOI). Our method employs a diffusion-based generative model that learns the distribution of plausible hand poses conditioned on affordance descriptions, which are inferred from a large vision-language model (VLM). This enables the refinement of occluded regions into more accurate and functionally coherent hand poses. Extensive experiments on HOGraspNet, a 3D hand-affordance dataset with severe occlusions, demonstrate that our affordance-guided refinement significantly improves hand pose estimation over both recent regression methods and diffusion-based refinement lacking contextual reasoning.

Affordance-Guided Diffusion Prior for 3D Hand Reconstruction

TL;DR

This work tackles 3D hand pose reconstruction under severe occlusion by using affordance-aware textual descriptions as context. It introduces an affordance-guided diffusion prior that learns a distribution of plausible hand poses conditioned on descriptions generated from a vision-language model (VLM) and summarized by an LLM, with the pose variable and diffusion steps up to . Starting from initial HaMeR estimates, the model refines occluded joints by aligning with the affordance-conditioned prior while maintaining visible joints via 2D keypoint fitting. The approach yields clear improvements on HOGraspNet, outperforming regression and unconditional diffusion baselines, and offers controllable, interpretable refinements through affordance descriptions, demonstrating strong potential for robust HOI understanding and dexterous manipulation.

Abstract

How can we reconstruct 3D hand poses when large portions of the hand are heavily occluded by itself or by objects? Humans often resolve such ambiguities by leveraging contextual knowledge -- such as affordances, where an object's shape and function suggest how the object is typically grasped. Inspired by this observation, we propose a generative prior for hand pose refinement guided by affordance-aware textual descriptions of hand-object interactions (HOI). Our method employs a diffusion-based generative model that learns the distribution of plausible hand poses conditioned on affordance descriptions, which are inferred from a large vision-language model (VLM). This enables the refinement of occluded regions into more accurate and functionally coherent hand poses. Extensive experiments on HOGraspNet, a 3D hand-affordance dataset with severe occlusions, demonstrate that our affordance-guided refinement significantly improves hand pose estimation over both recent regression methods and diffusion-based refinement lacking contextual reasoning.

Paper Structure

This paper contains 5 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Can affordance-aware textual descriptions benefit 3D hand reconstruction? We develop an affordance-guided diffusion prior model that refines 3D hand pose into more accurate and functionally coherent poses. Given the initial pose estimates from HaMeR pavlakos:cvpr24:HaMeR, our method achieves robust hand pose refinement under self-occlusion and object-occlusion on the HOGraspNet dataset cho:eccv24dense. The vertex error is color-coded on the hand mesh.
  • Figure 2: Affordance description generation with VLM (left) and affordance-guided diffusion prior for hand poses (right). Our proposed description generation scheme consists of two steps. The first is to obtain a parsed caption from the image and hand-object bounding boxes with VLM (QwenVL2.5 bai2025qwen2.5vl). The second is to summarize the parsed caption using an LLM (Mistral-7B albert:corr23:mistral7b) to obtain detailed descriptions of affordances. Our proposed model is trained to generate hand poses that align with the generated descriptions. The overall architecture is inspired by InterHandGen lee:cvpr24:interhandgen, by replacing the counter hand conditioning with affordance descriptions.
  • Figure 3: Single-view pose refinement. We diffuse the initial 3D hand poses for $n_r$ steps and denoise it with 2D keypoint fitting, inspired by Ohkawa:iccv25:SCGen. Our refinement utilizes affordance descriptions as the condition and corrects occluded joints (blue), while visible joints (green) remain fixed. Occlusion labels are obtained from two criteria: Self-occlusion (ray casting on the MANO mesh) and object-occlusion (SAM2 mask).
  • Figure 4: Qualitative results of our diffusion-based refinement. We show examples of refinement result with our diffusion prior with the vertex error is color-coded on the hand mesh. While the initial hand pose from HaMeR pavlakos:cvpr24:HaMeR struggles to estimate joints in occluded regions, our method reasonably refines them so that the hand can plausibly grasp the object even under occlusion.
  • Figure 5: Qualitative results of our diffusion prior. We show samples generated from partially different descriptions. We find that the diffusion-based generation responds to object size, while remaining consistent with the grasp taxonomy.