Table of Contents
Fetching ...

HandOS: 3D Hand Reconstruction in One Stage

Xingyu Chen, Zhuheng Song, Xiaoke Jiang, Yaoqing Hu, Junzhi Yu, Lei Zhang

TL;DR

HandOS introduces a fully end-to-end one-stage framework for 3D hand mesh reconstruction by freezing a detector and integrating 2D keypoint estimation with 3D mesh reasoning through an interactive 2D-3D decoder. Key innovations include instance-to-joint query expansion, 2D-to-3D query lifting with a learnable lifting matrix, and hierarchical attention that enables simultaneous modeling of 2D joints, 3D vertices, and camera translation, removing the need for explicit left-right classification. The approach achieves state-of-the-art results across major benchmarks (e.g., FreiHand, HO3Dv3, DexYCB, Hint) and demonstrates robustness to imperfect detections and diverse scenes, with strong 2D/3D complementary predictions. The work offers practical impact for real-world hand understanding tasks by reducing computation, mitigating error propagation, and enabling unified hand detection, pose estimation, and mesh reconstruction in a single pass.

Abstract

Existing approaches of hand reconstruction predominantly adhere to a multi-stage framework, encompassing detection, left-right classification, and pose estimation. This paradigm induces redundant computation and cumulative errors. In this work, we propose HandOS, an end-to-end framework for 3D hand reconstruction. Our central motivation lies in leveraging a frozen detector as the foundation while incorporating auxiliary modules for 2D and 3D keypoint estimation. In this manner, we integrate the pose estimation capacity into the detection framework, while at the same time obviating the necessity of using the left-right category as a prerequisite. Specifically, we propose an interactive 2D-3D decoder, where 2D joint semantics is derived from detection cues while 3D representation is lifted from those of 2D joints. Furthermore, hierarchical attention is designed to enable the concurrent modeling of 2D joints, 3D vertices, and camera translation. Consequently, we achieve an end-to-end integration of hand detection, 2D pose estimation, and 3D mesh reconstruction within a one-stage framework, so that the above multi-stage drawbacks are overcome. Meanwhile, the HandOS reaches state-of-the-art performances on public benchmarks, e.g., 5.0 PA-MPJPE on FreiHand and 64.6\% PCK@0.05 on HInt-Ego4D. Project page: idea-research.github.io/HandOSweb.

HandOS: 3D Hand Reconstruction in One Stage

TL;DR

HandOS introduces a fully end-to-end one-stage framework for 3D hand mesh reconstruction by freezing a detector and integrating 2D keypoint estimation with 3D mesh reasoning through an interactive 2D-3D decoder. Key innovations include instance-to-joint query expansion, 2D-to-3D query lifting with a learnable lifting matrix, and hierarchical attention that enables simultaneous modeling of 2D joints, 3D vertices, and camera translation, removing the need for explicit left-right classification. The approach achieves state-of-the-art results across major benchmarks (e.g., FreiHand, HO3Dv3, DexYCB, Hint) and demonstrates robustness to imperfect detections and diverse scenes, with strong 2D/3D complementary predictions. The work offers practical impact for real-world hand understanding tasks by reducing computation, mitigating error propagation, and enabling unified hand detection, pose estimation, and mesh reconstruction in a single pass.

Abstract

Existing approaches of hand reconstruction predominantly adhere to a multi-stage framework, encompassing detection, left-right classification, and pose estimation. This paradigm induces redundant computation and cumulative errors. In this work, we propose HandOS, an end-to-end framework for 3D hand reconstruction. Our central motivation lies in leveraging a frozen detector as the foundation while incorporating auxiliary modules for 2D and 3D keypoint estimation. In this manner, we integrate the pose estimation capacity into the detection framework, while at the same time obviating the necessity of using the left-right category as a prerequisite. Specifically, we propose an interactive 2D-3D decoder, where 2D joint semantics is derived from detection cues while 3D representation is lifted from those of 2D joints. Furthermore, hierarchical attention is designed to enable the concurrent modeling of 2D joints, 3D vertices, and camera translation. Consequently, we achieve an end-to-end integration of hand detection, 2D pose estimation, and 3D mesh reconstruction within a one-stage framework, so that the above multi-stage drawbacks are overcome. Meanwhile, the HandOS reaches state-of-the-art performances on public benchmarks, e.g., 5.0 PA-MPJPE on FreiHand and 64.6\% PCK@0.05 on HInt-Ego4D. Project page: idea-research.github.io/HandOSweb.

Paper Structure

This paper contains 52 sections, 12 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Overview of HandOS framework. Left: overall architecture. Right: interactive 2D-3D decoder. With off-the-shelf features, bounding boxes, and category scores from a frozen detector, the interactive 2D-3D decoder, including query filtering, expansion, lifting, and interactive layers, can understand hand pose and shape via estimating keypoints in both 2D and 3D spaces. Each query $\mathbf Q$ is associated with a reference box, which is not depicted in the figure for conciseness.
  • Figure 2: Decoding layers. (a) Canonical 2D layer, popularly employed by previous works. (b) Interactive layer, where hierarchical attention is designed to effectively model 2D and 3D queries.
  • Figure 3: Normal vectors serve as left-right indicator. When applying right-hand faces to left or right vertices, the directions of the normal vectors are opposed, as illustrated by the purple lines.
  • Figure 4: Visualization of HO3Dv3 with actual detection box. We claim that using GT box (red) for downstream tasks is ill-suited.
  • Figure 5: Visual comparison. We are adept at handling long-tail textures, crowded hands, and unseen styles. Red arrows indicate errors.
  • ...and 10 more figures