Table of Contents
Fetching ...

AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

Chen Si, Yulin Liu, Bo Ai, Jianwen Xie, Rolandos Alexandros Potamias, Chuanxia Zheng, Hao Su

Abstract

We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent works with foundation approaches have shown that an increase in the quantity and diversity of training data can markedly improve performance and robustness in hand pose estimation, existing real-world-collected datasets on this task are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. In the RGB-only setting, we show that extending the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architecture and training scheme fixed. More impressively, the model trained with AnyHand shows stronger generalization to the out-of-domain HO-Cap dataset, without any fine-tuning. We also contribute a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on the HO-3D benchmark, showing the benefits of depth integration and the effectiveness of our synthetic data.

AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

Abstract

We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent works with foundation approaches have shown that an increase in the quantity and diversity of training data can markedly improve performance and robustness in hand pose estimation, existing real-world-collected datasets on this task are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. In the RGB-only setting, we show that extending the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architecture and training scheme fixed. More impressively, the model trained with AnyHand shows stronger generalization to the out-of-domain HO-Cap dataset, without any fine-tuning. We also contribute a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on the HO-3D benchmark, showing the benefits of depth integration and the effectiveness of our synthetic data.

Paper Structure

This paper contains 21 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 1: We propose AnyHand as a large-scale synthetic RGB-D dataset that substantially expands coverage of hand pose, hand-object interactions, occlusions, and viewpoint variations in the wild. When used to co-train state-of-the-art models such as HaMeR pavlakos2024hamer and WiLoR potamias2024wilor, it yields consistent improvements and supports robust 3D hand pose reconstruction across diverse real-world scenes. Predicted hand meshes from WiLoR co-trained with AnyHand are shown in pink.
  • Figure 2: Qualitative Vis. of controllable variations. We showcase representative samples from our generator by varying one factor at a time: skin tones (top), single-hand poses from DPoser-Hand lu2025dposerx, hand textures from Handy potamias2023handy, and forearm appearance from SMPLitex casas2023smplitex. These examples demonstrate the diversity of appearance and context that we leverage in AnyHand to better match in-the-wild conditions.
  • Figure 3: Qualitative Vis. Examples of AnyHand-Single (left) and AnyHand-Interact (right), with both HDR environment-map backgrounds (top) and real indoor scenes (bottom). In addition to diverse hand/arm appearance and poses, we have additional diversity on the interacted objects and grasp configurations, producing a wide range of object-induced hand occlusions and self-occlusions under varying perspectives.
  • Figure 4: Qualitative Vis. WiLoR w/ AnyHand vs. WiLoR on FreiHAND zimmermann2019freihand and AnyHand Test Set. Left to right: input, GT, WiLoR w/ AnyHand, WiLoR. Adding synthetic data improves fine-grained pose estimation, particularly fingertip bending and finger joint angles (as boxed), yielding meshes that better match the image evidence.
  • Figure 5: Scaling of HaMeR co-training with AnyHand. We retrain HaMeR pavlakos2024hamer while keeping its original real-data training set fixed, and vary the number of additional AnyHand samples used. We report PA-MPJPE and PA-MPVPE on FreiHand zimmermann2019freihand, HO-3D v2 hampali2020ho3d, and HO-Cap wang2024hocap, respectively. Co-training with synthetic data consistently reduces error, with diminishing returns beyond $\sim$2--4M samples.
  • ...and 4 more figures