VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation
Jun Zhou, Chi Xu, Kaifeng Tang, Yuting Ge, Tingrui Guo, Li Cheng
TL;DR
This work tackles hand-object pose estimation from a single RGB image by jointly learning visual cues and 3D physical cues, addressing both visual fidelity and physical plausibility. It introduces a Force Prediction module with local friction-cone forces and a physics network, trained under Newtonian constraints and semi-supervised pseudo force labels, to produce physically meaningful interaction signals. A two-stage Pose Aggregation pipeline combines visual-based refinement across kinematic levels with physics-based re-ranking to yield poses that are both accurate and physically plausible. Diffusion-based candidate generation provides multiple pose hypotheses, which are then filtered through the two aggregation stages to achieve state-of-the-art results on DexYCB and HO3D v2, with improved contact quality and reduced penetration. The approach is modular and end-to-end-friendly, enabling integration with different candidate generators and promising practical impact for AR/robotics tasks where faithful hand-object interaction matters.
Abstract
Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions; 2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in both pose accuracy and physical plausibility.
