Table of Contents
Fetching ...

ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping

Youxin Pang, Ruizhi Shao, Jiajun Zhang, Hanzhang Tu, Yun Liu, Boyao Zhou, Hongwen Zhang, Yebin Liu

TL;DR

ManiVideo tackles the challenge of generating realistic hand-object manipulation videos with dexterous and generalizable grasping. It introduces a multi-layer occlusion (MLO) representation that learns 3D occlusion relationships via occlusion-free normals $H$ and occlusion confidence maps $D$, embedded into a diffusion-based generator to enforce 3D-consistent HOI dynamics. To overcome data scarcity and object diversity, the method leverages Objaverse object priors and a two-stage training strategy that merges HOI videos with object-only data, enabling robust object appearance and geometry generalization. Experimental results on DexYCB and Objaverse-derived data demonstrate state-of-the-art performance in fidelity, 3D consistency, and temporal stability, with extensions to human-centered HOI video generation via fine-tuning on human-centric datasets.

Abstract

In this paper, we introduce ManiVideo, a novel method for generating consistent and temporally coherent bimanual hand-object manipulation videos from given motion sequences of hands and objects. The core idea of ManiVideo is the construction of a multi-layer occlusion (MLO) representation that learns 3D occlusion relationships from occlusion-free normal maps and occlusion confidence maps. By embedding the MLO structure into the UNet in two forms, the model enhances the 3D consistency of dexterous hand-object manipulation. To further achieve the generalizable grasping of objects, we integrate Objaverse, a large-scale 3D object dataset, to address the scarcity of video data, thereby facilitating the learning of extensive object consistency. Additionally, we propose an innovative training strategy that effectively integrates multiple datasets, supporting downstream tasks such as human-centric hand-object manipulation video generation. Through extensive experiments, we demonstrate that our approach not only achieves video generation with plausible hand-object interaction and generalizable objects, but also outperforms existing SOTA methods.

ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping

TL;DR

ManiVideo tackles the challenge of generating realistic hand-object manipulation videos with dexterous and generalizable grasping. It introduces a multi-layer occlusion (MLO) representation that learns 3D occlusion relationships via occlusion-free normals and occlusion confidence maps , embedded into a diffusion-based generator to enforce 3D-consistent HOI dynamics. To overcome data scarcity and object diversity, the method leverages Objaverse object priors and a two-stage training strategy that merges HOI videos with object-only data, enabling robust object appearance and geometry generalization. Experimental results on DexYCB and Objaverse-derived data demonstrate state-of-the-art performance in fidelity, 3D consistency, and temporal stability, with extensions to human-centered HOI video generation via fine-tuning on human-centric datasets.

Abstract

In this paper, we introduce ManiVideo, a novel method for generating consistent and temporally coherent bimanual hand-object manipulation videos from given motion sequences of hands and objects. The core idea of ManiVideo is the construction of a multi-layer occlusion (MLO) representation that learns 3D occlusion relationships from occlusion-free normal maps and occlusion confidence maps. By embedding the MLO structure into the UNet in two forms, the model enhances the 3D consistency of dexterous hand-object manipulation. To further achieve the generalizable grasping of objects, we integrate Objaverse, a large-scale 3D object dataset, to address the scarcity of video data, thereby facilitating the learning of extensive object consistency. Additionally, we propose an innovative training strategy that effectively integrates multiple datasets, supporting downstream tasks such as human-centric hand-object manipulation video generation. Through extensive experiments, we demonstrate that our approach not only achieves video generation with plausible hand-object interaction and generalizable objects, but also outperforms existing SOTA methods.

Paper Structure

This paper contains 16 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The overall framework of ManiVideo. Given raw hand-object signals, we first transform them into multi-layer occlusion (MLO) representation and object representation. MLO structure is designed to enforce the 3D consistency of HOI, which includes occlusion-free normal maps $H$ and occlusion confidence maps $D$. Object representation contains the appearance and geometry information, ensuring the dynamic consistency of objects. Then, we inject MLO representation and object representation into the denoising UNet and AppearanceNet.
  • Figure 2: Qualitative comparison of different methods on DexYCB dataset chao2021dexycb. Our results perform best in cases of hand-object mutual occlusion and finger self-occlusion.
  • Figure 3: Qualitative comparison of different methods on videos we collect. Our approach achieves the best results.
  • Figure 4: Ablation study of the multi-layer occlusion (MLO) representation. Without MLO structure, basic 2D conditions fail to ensure accurate structure and occlusion relationships between objects and fingers. Incomplete embedding (w/o MLO*) diminishes the effectiveness of the MLO representation.
  • Figure 5: Ablation study of object augmentation training. Utilizing Objaverse helps the model learn dynamic consistency from large object datasets.
  • ...and 2 more figures