Table of Contents
Fetching ...

GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Rao Fu, Dingxi Zhang, Alex Jiang, Wanjia Fu, Austin Funk, Daniel Ritchie, Srinath Sridhar

TL;DR

GigaHands tackles the scarcity of large-scale, richly annotated 3D bimanual hand activity data by introducing a markerless, multi-view dataset that captures 34 hours of activity from 56 subjects across 417 objects. It provides 183 million frames, 14k motion clips, and 84k text annotations, paired with fully automatic 3D hand and object estimation and dense camera views enabling dynamic radiance field reconstruction. The paper demonstrates the dataset's value through improvements in text-driven hand motion synthesis, motion captioning, and 3D scene reconstruction, powered by an instruct-to-annotate data collection pipeline. Overall, GigaHands advances the scalability and diversity of hand-action data, with significant implications for AI, robotics, and interactive 3D understanding.

Abstract

Understanding bimanual human hand activities is a critical problem in AI and robotics. We cannot build large models of bimanual activities because existing datasets lack the scale, coverage of diverse hand activities, and detailed annotations. We introduce GigaHands, a massive annotated dataset capturing 34 hours of bimanual hand activities from 56 subjects and 417 objects, totaling 14k motion clips derived from 183 million frames paired with 84k text annotations. Our markerless capture setup and data acquisition protocol enable fully automatic 3D hand and object estimation while minimizing the effort required for text annotation. The scale and diversity of GigaHands enable broad applications, including text-driven action synthesis, hand motion captioning, and dynamic radiance field reconstruction. Our website are avaliable at https://ivl.cs.brown.edu/research/gigahands.html .

GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

TL;DR

GigaHands tackles the scarcity of large-scale, richly annotated 3D bimanual hand activity data by introducing a markerless, multi-view dataset that captures 34 hours of activity from 56 subjects across 417 objects. It provides 183 million frames, 14k motion clips, and 84k text annotations, paired with fully automatic 3D hand and object estimation and dense camera views enabling dynamic radiance field reconstruction. The paper demonstrates the dataset's value through improvements in text-driven hand motion synthesis, motion captioning, and 3D scene reconstruction, powered by an instruct-to-annotate data collection pipeline. Overall, GigaHands advances the scalability and diversity of hand-action data, with significant implications for AI, robotics, and interactive 3D understanding.

Abstract

Understanding bimanual human hand activities is a critical problem in AI and robotics. We cannot build large models of bimanual activities because existing datasets lack the scale, coverage of diverse hand activities, and detailed annotations. We introduce GigaHands, a massive annotated dataset capturing 34 hours of bimanual hand activities from 56 subjects and 417 objects, totaling 14k motion clips derived from 183 million frames paired with 84k text annotations. Our markerless capture setup and data acquisition protocol enable fully automatic 3D hand and object estimation while minimizing the effort required for text annotation. The scale and diversity of GigaHands enable broad applications, including text-driven action synthesis, hand motion captioning, and dynamic radiance field reconstruction. Our website are avaliable at https://ivl.cs.brown.edu/research/gigahands.html .

Paper Structure

This paper contains 29 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: GigaHands is a massive dataset of human bimanual activities with paired text annotations. Each column above shows an activity sequence from the dataset. The dataset covers diverse 3D hand activities, including hand-object interactions (blue) across object scales, gestures (orange), and self-interactions (red). Each clip is paired with descriptive text and 51 camera views, enabling radiance field reconstruction. The bottom row show other annotations in the dataset including hand shape, object shape and pose (left half images). The right half images show novel views from dynamic radiance field fitting.
  • Figure 2: Dataset Diversity. The left and middle figures illustrate the diversity of pose and motion variations in GigaHands, visualized using t-SNE embeddings. Some points along the convex hull are highlighted with their corresponding text instructions, showcasing unique motions captured in our dataset. The right figure compares the verb sets among different datasets using an UpSet visualization lex2014upset. Each column represents the number of verbs exclusive to specific subsets of datasets, indicated by the connected dots below the columns. The rows indicate the total verb count in each dataset. GigaHands contains more verbs and more exclusive verbs compared to other datasets.
  • Figure 3: Diverse Objects and Frequent Hand Contact Regions. GigaHands provide objects (left) spanning diverse scenarios, including cooking, office working, crafting, entertainment, and housework. The diverse activities result in contact regions (right) spanning both the front and back of both hands.
  • Figure 4: Instruct-to-Annotate Pipeline. The instruction elicitation process (left yellow block) creates atomic action-level instruction scripts in a temporally smooth order, structured within scenes. This is achieved by parsing action datasets, grouping verbs into a pool, structuring scenarios, and generating scene scripts. During filming, subjects act according to these scripts, producing recorded motion sequences. Annotators then process these sequences (right blue block) by segmenting them into clips and annotating unscripted motions.
  • Figure 5: Generated motions from models trained on different datasets. Texts highlighted in green, orange, and blue come from the OakInk2, TACO, and GigaHands datasets. In the bottom two rows, hand meshes highlighted in these colors are generated by models trained on the corresponding datasets. The model trained on GigaHands can generate diverse motions from a single text (right four columns) and accurate motion with text from other datasets (left two columns). Darker color indicates later frame in the sequence.
  • ...and 3 more figures