Table of Contents
Fetching ...

BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting

Jeongwan On, Kyeonghwan Gwak, Gunyoung Kang, Junuk Cha, Soohyun Hwang, Hyein Hwang, Seungryul Baek

TL;DR

BIGS tackles monocular video reconstruction of bimanual hand-object interactions with an unknown object by building 3D Gaussians for two hands and the object. It introduces a two-stage optimization with a shared hand Gaussian to accumulate hand information, and diffusion-prior–guided object Gaussians guided by SDS and textual inversion to recover occluded surfaces, plus an interacting-subjects step to align hands and object. The method leverages MANO hand priors, TriplaneNet features, and a diffusion prior to render novel viewpoints, achieving state-of-the-art accuracy on $MPJPE$, $CD_o$, $F10$, and rendering metrics ($PSNR$, $SSIM$, $LPIPS$) on challenging datasets like ARCTIC and HO3Dv3. This work advances category-agnostic, monocular HOI reconstruction, effectively handling severe occlusions and enabling fast, view-consistent rendering for downstream applications.

Abstract

Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, where object templates are pre-known or only one hand is involved in the interaction. The bimanual interaction reconstruction exhibits severe occlusions introduced by complex interactions between two hands and an object. To solve this, we first introduce BIGS (Bimanual Interaction 3D Gaussian Splatting), a method that reconstructs 3D Gaussians of hands and an unknown object from a monocular video. To robustly obtain object Gaussians avoiding severe occlusions, we leverage prior knowledge of pre-trained diffusion model with score distillation sampling (SDS) loss, to reconstruct unseen object parts. For hand Gaussians, we exploit the 3D priors of hand model (i.e., MANO) and share a single Gaussian for two hands to effectively accumulate hand 3D information, given limited views. To further consider the 3D alignment between hands and objects, we include the interacting-subjects optimization step during Gaussian optimization. Our method achieves the state-of-the-art accuracy on two challenging datasets, in terms of 3D hand pose estimation (MPJPE), 3D object reconstruction (CDh, CDo, F10), and rendering quality (PSNR, SSIM, LPIPS), respectively.

BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting

TL;DR

BIGS tackles monocular video reconstruction of bimanual hand-object interactions with an unknown object by building 3D Gaussians for two hands and the object. It introduces a two-stage optimization with a shared hand Gaussian to accumulate hand information, and diffusion-prior–guided object Gaussians guided by SDS and textual inversion to recover occluded surfaces, plus an interacting-subjects step to align hands and object. The method leverages MANO hand priors, TriplaneNet features, and a diffusion prior to render novel viewpoints, achieving state-of-the-art accuracy on , , , and rendering metrics (, , ) on challenging datasets like ARCTIC and HO3Dv3. This work advances category-agnostic, monocular HOI reconstruction, effectively handling severe occlusions and enabling fast, view-consistent rendering for downstream applications.

Abstract

Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, where object templates are pre-known or only one hand is involved in the interaction. The bimanual interaction reconstruction exhibits severe occlusions introduced by complex interactions between two hands and an object. To solve this, we first introduce BIGS (Bimanual Interaction 3D Gaussian Splatting), a method that reconstructs 3D Gaussians of hands and an unknown object from a monocular video. To robustly obtain object Gaussians avoiding severe occlusions, we leverage prior knowledge of pre-trained diffusion model with score distillation sampling (SDS) loss, to reconstruct unseen object parts. For hand Gaussians, we exploit the 3D priors of hand model (i.e., MANO) and share a single Gaussian for two hands to effectively accumulate hand 3D information, given limited views. To further consider the 3D alignment between hands and objects, we include the interacting-subjects optimization step during Gaussian optimization. Our method achieves the state-of-the-art accuracy on two challenging datasets, in terms of 3D hand pose estimation (MPJPE), 3D object reconstruction (CDh, CDo, F10), and rendering quality (PSNR, SSIM, LPIPS), respectively.

Paper Structure

This paper contains 13 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Our approach reconstructs 3D Gaussians of bimanual category-agnostic interactions from a monocular video, where the two hands interact with an unknown object. Even with limited observations, our method reliably builds the 3D Gaussians in this scenario and once 3D Gaussians are built, our method can be used to render new videos with novel poses of hand, object and camera (i.e., view).
  • Figure 2: Overview of the BIGS pipeline. Initial hand meshes $\mathcal{M}_H$ and initial object meshes $\mathcal{M}_O$ are reconstructed in pre-processing step (See supplemental for details). Afterwards, the 'single-subject optimization' step optimizes 3D Gaussians for hands $\mathcal{G}_H$ and objects $\mathcal{G}_O$ in the canonical space: TriplaneNet $\{\mathcal{T}^H,\mathcal{T}^O\}$, MLPs $\{f^H_D, f^H_A, f^O_A, f^H_G, f^O_G\}$ and learnable parameters $\mathbf{P}^t$ are updated using the Eq. \ref{['eq:single_optim']}. Subsequently, the 'interacting-subjects optimization' step further reflects contacts between hands and objects, and refines initial hand Gaussians $\mathcal{G}_H$ using the Eq. \ref{['eq:joint_optim']}.
  • Figure 3: Qualitative examples for 3D meshes (Rows 1, 2) and 2D rendered images (Rows 3, 4): Here, we exemplify 3D meshes and 2D rendered images, obtained from 'HOLD' fan2024hold and 'Ours' on ARCTIC fan2023arctic and HO3D hampali2020honnotate datasets, respectively. For each example, first rows visualize results in the original viewpoint and second rows visualize results in the novel viewpoint. From 3D mesh examples, we can see that 'Ours' exhibits better alignment between hands and objects in the 3D space both in the camera and novel viewpoints; while the outputs of 'HOLD' fan2024hold are not complete even in the camera viewpoint and 3D locations become completely wrong in the novel viewpoints. 'Ours' exhibits cleaner rendering quality both in the camera and novel viewpoints; while the outputs of 'HOLD' fan2024hold suffers from the noise.