Table of Contents
Fetching ...

HOGSA: Bimanual Hand-Object Interaction Understanding with 3D Gaussian Splatting Based Data Augmentation

Wentian Qu, Jiahe Li, Jian Cheng, Jian Shi, Chenyu Meng, Cuixia Ma, Hongan Wang, Xiaoming Deng, Yinda Zhang

TL;DR

This work tackles the data scarcity challenge in bimanual hand-object interaction by introducing HOGSA, a 3D Gaussian Splatting–based data augmentation framework. It combines mesh-based 3DGS modeling, a Pose Optimization Module to diversify hand-object poses, and a Super-Resolution Module to produce photorealistic renderings, enabling end-to-end data generation. The approach expands Arctic and H2O datasets to roughly 1.7M and 0.7M augmented images, respectively, and consistently improves baseline performance on pose, contact, and interaction-field tasks. The method delivers faster augmentation and higher realism, reducing occlusion-related errors and enabling more robust learning for robotics and VR applications.

Abstract

Understanding of bimanual hand-object interaction plays an important role in robotics and virtual reality. However, due to significant occlusions between hands and object as well as the high degree-of-freedom motions, it is challenging to collect and annotate a high-quality, large-scale dataset, which prevents further improvement of bimanual hand-object interaction-related baselines. In this work, we propose a new 3D Gaussian Splatting based data augmentation framework for bimanual hand-object interaction, which is capable of augmenting existing dataset to large-scale photorealistic data with various hand-object pose and viewpoints. First, we use mesh-based 3DGS to model objects and hands, and to deal with the rendering blur problem due to multi-resolution input images used, we design a super-resolution module. Second, we extend the single hand grasping pose optimization module for the bimanual hand object to generate various poses of bimanual hand-object interaction, which can significantly expand the pose distribution of the dataset. Third, we conduct an analysis for the impact of different aspects of the proposed data augmentation on the understanding of the bimanual hand-object interaction. We perform our data augmentation on two benchmarks, H2O and Arctic, and verify that our method can improve the performance of the baselines.

HOGSA: Bimanual Hand-Object Interaction Understanding with 3D Gaussian Splatting Based Data Augmentation

TL;DR

This work tackles the data scarcity challenge in bimanual hand-object interaction by introducing HOGSA, a 3D Gaussian Splatting–based data augmentation framework. It combines mesh-based 3DGS modeling, a Pose Optimization Module to diversify hand-object poses, and a Super-Resolution Module to produce photorealistic renderings, enabling end-to-end data generation. The approach expands Arctic and H2O datasets to roughly 1.7M and 0.7M augmented images, respectively, and consistently improves baseline performance on pose, contact, and interaction-field tasks. The method delivers faster augmentation and higher realism, reducing occlusion-related errors and enabling more robust learning for robotics and VR applications.

Abstract

Understanding of bimanual hand-object interaction plays an important role in robotics and virtual reality. However, due to significant occlusions between hands and object as well as the high degree-of-freedom motions, it is challenging to collect and annotate a high-quality, large-scale dataset, which prevents further improvement of bimanual hand-object interaction-related baselines. In this work, we propose a new 3D Gaussian Splatting based data augmentation framework for bimanual hand-object interaction, which is capable of augmenting existing dataset to large-scale photorealistic data with various hand-object pose and viewpoints. First, we use mesh-based 3DGS to model objects and hands, and to deal with the rendering blur problem due to multi-resolution input images used, we design a super-resolution module. Second, we extend the single hand grasping pose optimization module for the bimanual hand object to generate various poses of bimanual hand-object interaction, which can significantly expand the pose distribution of the dataset. Third, we conduct an analysis for the impact of different aspects of the proposed data augmentation on the understanding of the bimanual hand-object interaction. We perform our data augmentation on two benchmarks, H2O and Arctic, and verify that our method can improve the performance of the baselines.
Paper Structure (23 sections, 9 equations, 13 figures, 4 tables)

This paper contains 23 sections, 9 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: We propose a new 3DGS-based data augmentation framework for bimanual hand-object interaction to augment existing dataset with various hand-object pose and viewpoints. Our method can improve the performance of the baselines, and achieve more accurate pose and contact.
  • Figure 2: Overview of our data augmentation framework for bimanual hand-object interaction. Based on the original dataset, we first establish mesh-based 3DGS models and input the original poses to pose optimization module to expand the diversity of interaction. The novel pose and 3DGS can be combined to render the low-quality image, which is then fed into the super-resolution module to further enhance the realism. Based on the above modules, we can automatically build an expanded dataset and support model fine-tuning for the interaction understanding baseline to improve performance.
  • Figure 3: Examples of our HOGSA, which contains diverse interactive poses and ensures the realism of the images.
  • Figure 4: The augmented data we used to train the baseline. Compared with the original data, our images ensure realism and have various poses.
  • Figure 5: Qualitative results of our data augmentation method HOGSA on the baseline. After optimization, the model can cover a wider range of interactive poses and achieve a more accurate estimation of the pose and contact area.
  • ...and 8 more figures