Table of Contents
Fetching ...

GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting

Ahmed Tawfik Aboukhadra, Marcel Rogge, Nadia Robertini, Abdalla Arafa, Jameel Malik, Ahmed Elhayek, Didier Stricker

Abstract

Understanding realistic hand-object interactions from monocular RGB videos is essential for AR/VR, robotics, and embodied AI. Existing methods rely on category-specific templates or heavy computation, yet still produce physically inconsistent hand-object alignment in 3D. We introduce GHOST (Gaussian Hand-Object Splatting), a fast, category-agnostic framework for reconstructing dynamic hand-object interactions using 2D Gaussian Splatting. GHOST represents both hands and objects as dense, view-consistent Gaussian discs and introduces three key innovations: (1) a geometric-prior retrieval and consistency loss that completes occluded object regions, (2) a grasp-aware alignment that refines hand translations and object scale to ensure realistic contact, and (3) a hand-aware background loss that prevents penalizing hand-occluded object regions. GHOST achieves complete, physically consistent, and animatable reconstructions from a single RGB video while running an order of magnitude faster than prior category-agnostic methods. Extensive experiments on ARCTIC, HO3D, and in-the-wild datasets demonstrate state-of-the-art accuracy in 3D reconstruction and 2D rendering quality, establishing GHOST as an efficient and robust solution for realistic hand-object interaction modeling. Code is available at https://github.com/ATAboukhadra/GHOST.

GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting

Abstract

Understanding realistic hand-object interactions from monocular RGB videos is essential for AR/VR, robotics, and embodied AI. Existing methods rely on category-specific templates or heavy computation, yet still produce physically inconsistent hand-object alignment in 3D. We introduce GHOST (Gaussian Hand-Object Splatting), a fast, category-agnostic framework for reconstructing dynamic hand-object interactions using 2D Gaussian Splatting. GHOST represents both hands and objects as dense, view-consistent Gaussian discs and introduces three key innovations: (1) a geometric-prior retrieval and consistency loss that completes occluded object regions, (2) a grasp-aware alignment that refines hand translations and object scale to ensure realistic contact, and (3) a hand-aware background loss that prevents penalizing hand-occluded object regions. GHOST achieves complete, physically consistent, and animatable reconstructions from a single RGB video while running an order of magnitude faster than prior category-agnostic methods. Extensive experiments on ARCTIC, HO3D, and in-the-wild datasets demonstrate state-of-the-art accuracy in 3D reconstruction and 2D rendering quality, establishing GHOST as an efficient and robust solution for realistic hand-object interaction modeling. Code is available at https://github.com/ATAboukhadra/GHOST.
Paper Structure (36 sections, 13 equations, 15 figures, 3 tables)

This paper contains 36 sections, 13 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Our method reconstructs complete 3D hand–object interactions from a single monocular RGB video—recovering full object surfaces and realistic hand contact even under severe occlusions—while enabling fast, accurate, category-agnostic reconstruction and novel-view rendering.
  • Figure 2: (Top) Overview of our pipeline, which consists of three stages. In preprocessing, we extract hand meshes, camera poses, and object information (i.e. mask, point cloud, and geometric prior). During hand–object alignment, object's scale and hand translations are optimized using grasp-aware and temporal reasoning. In the Gaussian Splatting stage, hands and objects are jointly reconstructed with occlusion-aware losses. (Bottom) Retrieval and alignment pipeline used to obtain the object’s geometric prior.
  • Figure 3: Qualitative comparison showing (a) the effect of our novel background loss $\mathcal{L}_{bkg,h}$ on object reconstruction quality. (b) Top: Gaussian centers on the canonical hand; middle: deformed hand mesh with aligned Gaussian centers after $\mathcal{T}_{aff}$; bottom: final animatable Gaussian hand after training.
  • Figure 4: Qualitative results of GHOST on ARCTIC fan2023arctic, HO3D hampali2020honnotate, and in-the-wild examples: Left: shows aligned 3D hand meshes with the object’s geometric prior obtained during HO alignment. Right: presents photorealistic Gaussian Splatting renderings from original viewpoint and novel viewpoints. GHOST produces consistent hand–object alignment in 3D and maintains realistic appearance even under view changes, enabling physically plausible interaction reconstruction and high-fidelity rendering across viewpoints.
  • Figure 5: Qualitative comparison demonstrating the effect of the geometric loss $\mathcal{L}_{geo}$ on the quality of reconstructed object point clouds derived from Gaussian centers.
  • ...and 10 more figures