HandOS: 3D Hand Reconstruction in One Stage

Xingyu Chen; Zhuheng Song; Xiaoke Jiang; Yaoqing Hu; Junzhi Yu; Lei Zhang

HandOS: 3D Hand Reconstruction in One Stage

Xingyu Chen, Zhuheng Song, Xiaoke Jiang, Yaoqing Hu, Junzhi Yu, Lei Zhang

TL;DR

HandOS introduces a fully end-to-end one-stage framework for 3D hand mesh reconstruction by freezing a detector and integrating 2D keypoint estimation with 3D mesh reasoning through an interactive 2D-3D decoder. Key innovations include instance-to-joint query expansion, 2D-to-3D query lifting with a learnable lifting matrix, and hierarchical attention that enables simultaneous modeling of 2D joints, 3D vertices, and camera translation, removing the need for explicit left-right classification. The approach achieves state-of-the-art results across major benchmarks (e.g., FreiHand, HO3Dv3, DexYCB, Hint) and demonstrates robustness to imperfect detections and diverse scenes, with strong 2D/3D complementary predictions. The work offers practical impact for real-world hand understanding tasks by reducing computation, mitigating error propagation, and enabling unified hand detection, pose estimation, and mesh reconstruction in a single pass.

Abstract

Existing approaches of hand reconstruction predominantly adhere to a multi-stage framework, encompassing detection, left-right classification, and pose estimation. This paradigm induces redundant computation and cumulative errors. In this work, we propose HandOS, an end-to-end framework for 3D hand reconstruction. Our central motivation lies in leveraging a frozen detector as the foundation while incorporating auxiliary modules for 2D and 3D keypoint estimation. In this manner, we integrate the pose estimation capacity into the detection framework, while at the same time obviating the necessity of using the left-right category as a prerequisite. Specifically, we propose an interactive 2D-3D decoder, where 2D joint semantics is derived from detection cues while 3D representation is lifted from those of 2D joints. Furthermore, hierarchical attention is designed to enable the concurrent modeling of 2D joints, 3D vertices, and camera translation. Consequently, we achieve an end-to-end integration of hand detection, 2D pose estimation, and 3D mesh reconstruction within a one-stage framework, so that the above multi-stage drawbacks are overcome. Meanwhile, the HandOS reaches state-of-the-art performances on public benchmarks, e.g., 5.0 PA-MPJPE on FreiHand and 64.6\% PCK@0.05 on HInt-Ego4D. Project page: idea-research.github.io/HandOSweb.

HandOS: 3D Hand Reconstruction in One Stage

TL;DR

Abstract

HandOS: 3D Hand Reconstruction in One Stage

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)