Table of Contents
Fetching ...

VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

Jun Zhou, Chi Xu, Kaifeng Tang, Yuting Ge, Tingrui Guo, Li Cheng

TL;DR

This work tackles hand-object pose estimation from a single RGB image by jointly learning visual cues and 3D physical cues, addressing both visual fidelity and physical plausibility. It introduces a Force Prediction module with local friction-cone forces and a physics network, trained under Newtonian constraints and semi-supervised pseudo force labels, to produce physically meaningful interaction signals. A two-stage Pose Aggregation pipeline combines visual-based refinement across kinematic levels with physics-based re-ranking to yield poses that are both accurate and physically plausible. Diffusion-based candidate generation provides multiple pose hypotheses, which are then filtered through the two aggregation stages to achieve state-of-the-art results on DexYCB and HO3D v2, with improved contact quality and reduced penetration. The approach is modular and end-to-end-friendly, enabling integration with different candidate generators and promising practical impact for AR/robotics tasks where faithful hand-object interaction matters.

Abstract

Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions; 2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in both pose accuracy and physical plausibility.

VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

TL;DR

This work tackles hand-object pose estimation from a single RGB image by jointly learning visual cues and 3D physical cues, addressing both visual fidelity and physical plausibility. It introduces a Force Prediction module with local friction-cone forces and a physics network, trained under Newtonian constraints and semi-supervised pseudo force labels, to produce physically meaningful interaction signals. A two-stage Pose Aggregation pipeline combines visual-based refinement across kinematic levels with physics-based re-ranking to yield poses that are both accurate and physically plausible. Diffusion-based candidate generation provides multiple pose hypotheses, which are then filtered through the two aggregation stages to achieve state-of-the-art results on DexYCB and HO3D v2, with improved contact quality and reduced penetration. The approach is modular and end-to-end-friendly, enabling integration with different candidate generators and promising practical impact for AR/robotics tasks where faithful hand-object interaction matters.

Abstract

Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions; 2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in both pose accuracy and physical plausibility.

Paper Structure

This paper contains 30 sections, 23 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Visual comparison between a state-of-the-art visual-only method, HFL, and a method that incorporates physical cues, DeepSimHO. HFL yields visually aligned yet physically implausible results (red circles), while DeepSimHO improves physical plausibility at the cost of visual alignment (blue circles).
  • Figure 2: The framework of our approach, consisting of the following four modules: feature extraction, force prediction, candidate generation and pose aggregation.
  • Figure 3: Friction cone and force representations in (a) local and (b) global coordinate frames.
  • Figure 4: The architecture of our physics network.
  • Figure 5: (a) The levels of hand pose parameters; (b) The visual-based aggregation hierarchically aggregate hand joints from lower to higher levels.
  • ...and 4 more figures