Table of Contents
Fetching ...

WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos

Hanhui Li, Xuan Huang, Wanquan Liu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang, Chenqiang Gao

TL;DR

WildGHand is an optimization-based framework that enables self-adaptive 3D Gaussian splatting on in-the-wild videos and produces high-fidelity hand avatars and achieves state-of-the-art performance and substantially improves over its base model across multiple metrics.

Abstract

Despite recent progress in 3D hand reconstruction from monocular videos, most existing methods rely on data captured in well-controlled environments and therefore degrade in real-world settings with severe perturbations, such as hand-object interactions, extreme poses, illumination changes, and motion blur. To tackle these issues, we introduce WildGHand, an optimization-based framework that enables self-adaptive 3D Gaussian splatting on in-the-wild videos and produces high-fidelity hand avatars. WildGHand incorporates two key components: (i) a dynamic perturbation disentanglement module that explicitly represents perturbations as time-varying biases on 3D Gaussian attributes during optimization, and (ii) a perturbation-aware optimization strategy that generates per-frame anisotropic weighted masks to guide optimization. Together, these components allow the framework to identify and suppress perturbations across both spatial and temporal dimensions. We further curate a dataset of monocular hand videos captured under diverse perturbations to benchmark in-the-wild hand avatar reconstruction. Extensive experiments on this dataset and two public datasets demonstrate that WildGHand achieves state-of-the-art performance and substantially improves over its base model across multiple metrics (e.g., up to a $15.8\%$ relative gain in PSNR and a $23.1\%$ relative reduction in LPIPS). Our implementation and dataset are available at https://github.com/XuanHuang0/WildGHand.

WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos

TL;DR

WildGHand is an optimization-based framework that enables self-adaptive 3D Gaussian splatting on in-the-wild videos and produces high-fidelity hand avatars and achieves state-of-the-art performance and substantially improves over its base model across multiple metrics.

Abstract

Despite recent progress in 3D hand reconstruction from monocular videos, most existing methods rely on data captured in well-controlled environments and therefore degrade in real-world settings with severe perturbations, such as hand-object interactions, extreme poses, illumination changes, and motion blur. To tackle these issues, we introduce WildGHand, an optimization-based framework that enables self-adaptive 3D Gaussian splatting on in-the-wild videos and produces high-fidelity hand avatars. WildGHand incorporates two key components: (i) a dynamic perturbation disentanglement module that explicitly represents perturbations as time-varying biases on 3D Gaussian attributes during optimization, and (ii) a perturbation-aware optimization strategy that generates per-frame anisotropic weighted masks to guide optimization. Together, these components allow the framework to identify and suppress perturbations across both spatial and temporal dimensions. We further curate a dataset of monocular hand videos captured under diverse perturbations to benchmark in-the-wild hand avatar reconstruction. Extensive experiments on this dataset and two public datasets demonstrate that WildGHand achieves state-of-the-art performance and substantially improves over its base model across multiple metrics (e.g., up to a relative gain in PSNR and a relative reduction in LPIPS). Our implementation and dataset are available at https://github.com/XuanHuang0/WildGHand.
Paper Structure (19 sections, 9 equations, 6 figures, 10 tables)

This paper contains 19 sections, 9 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: We present WildGHand, a novel Gaussian splatting framework for generating realistic hand avatars from short monocular videos exhibiting challenging perturbations, including hand-object interactions, complex poses, illumination variations, and motion blur.
  • Figure 2: The proposed WildGHand framework. Given a monocular video affected by perturbations, WildGHand introduces two key components to achieve the robust estimation of 3D Gaussians, including a lightweight dynamic perturbation disentanglement (DPD) module and a perturbation-aware optimization (PAO) strategy. The DPD module represents potential perturbations by biases of Gaussian attributes, which are optimized guided by the weighted masks predicted by the PAO strategy. During inference, the optimized biases are removed to render perturbation-free images.
  • Figure 3: Illustration of the proposed PAO and DPD modules. Left: Our perturbation-aware optimization (PAO) strategy segments the hand regions and leverages reconstruction error to generate weighted masks to guide the optimization of 3DGS. Right: The temporal weights estimated by our dynamic perturbation disentanglement (DPD) module that reflect the strengths of perturbations. Partial perturbations (e.g., occlusions) tends to have smaller weights (labeled in green), while holistic perturbations (e.g., motion blur) have larger weights (labeled in red).
  • Figure 4: Visual examples of our HWP dataset, which covers diverse challenging scenes like motion blur, hand-object interactions, illumination variations, and complex poses (from top left to bottom right).
  • Figure 5: Qualitative comparisons between our proposed WildGHand model with state-of-the-art methods on interacting-hand videos (top) and single-hand videos (bottom).
  • ...and 1 more figures