VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

Jun Zhou; Chi Xu; Kaifeng Tang; Yuting Ge; Tingrui Guo; Li Cheng

VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

Jun Zhou, Chi Xu, Kaifeng Tang, Yuting Ge, Tingrui Guo, Li Cheng

TL;DR

This work tackles hand-object pose estimation from a single RGB image by jointly learning visual cues and 3D physical cues, addressing both visual fidelity and physical plausibility. It introduces a Force Prediction module with local friction-cone forces and a physics network, trained under Newtonian constraints and semi-supervised pseudo force labels, to produce physically meaningful interaction signals. A two-stage Pose Aggregation pipeline combines visual-based refinement across kinematic levels with physics-based re-ranking to yield poses that are both accurate and physically plausible. Diffusion-based candidate generation provides multiple pose hypotheses, which are then filtered through the two aggregation stages to achieve state-of-the-art results on DexYCB and HO3D v2, with improved contact quality and reduced penetration. The approach is modular and end-to-end-friendly, enabling integration with different candidate generators and promising practical impact for AR/robotics tasks where faithful hand-object interaction matters.

Abstract

Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions; 2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in both pose accuracy and physical plausibility.

VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

TL;DR

Abstract

VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)