Table of Contents
Fetching ...

TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding

Yun Liu, Haolin Yang, Xu Si, Ling Liu, Zipeng Li, Yuxiang Zhang, Yebin Liu, Li Yi

TL;DR

TACO, an extensive bimanual hand-object-interaction dataset spanning a large variety of tool-action-object compositions for daily human activities, is constructed and benchmark three generalizable hand-object-interaction tasks: compositional action recognition, generalizable hand-object motion forecasting, and cooperative grasp synthesis.

Abstract

Humans commonly work with multiple objects in daily life and can intuitively transfer manipulation skills to novel objects by understanding object functional regularities. However, existing technical approaches for analyzing and synthesizing hand-object manipulation are mostly limited to handling a single hand and object due to the lack of data support. To address this, we construct TACO, an extensive bimanual hand-object-interaction dataset spanning a large variety of tool-action-object compositions for daily human activities. TACO contains 2.5K motion sequences paired with third-person and egocentric views, precise hand-object 3D meshes, and action labels. To rapidly expand the data scale, we present a fully automatic data acquisition pipeline combining multi-view sensing with an optical motion capture system. With the vast research fields provided by TACO, we benchmark three generalizable hand-object-interaction tasks: compositional action recognition, generalizable hand-object motion forecasting, and cooperative grasp synthesis. Extensive experiments reveal new insights, challenges, and opportunities for advancing the studies of generalizable hand-object motion analysis and synthesis. Our data and code are available at https://taco2024.github.io.

TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding

TL;DR

TACO, an extensive bimanual hand-object-interaction dataset spanning a large variety of tool-action-object compositions for daily human activities, is constructed and benchmark three generalizable hand-object-interaction tasks: compositional action recognition, generalizable hand-object motion forecasting, and cooperative grasp synthesis.

Abstract

Humans commonly work with multiple objects in daily life and can intuitively transfer manipulation skills to novel objects by understanding object functional regularities. However, existing technical approaches for analyzing and synthesizing hand-object manipulation are mostly limited to handling a single hand and object due to the lack of data support. To address this, we construct TACO, an extensive bimanual hand-object-interaction dataset spanning a large variety of tool-action-object compositions for daily human activities. TACO contains 2.5K motion sequences paired with third-person and egocentric views, precise hand-object 3D meshes, and action labels. To rapidly expand the data scale, we present a fully automatic data acquisition pipeline combining multi-view sensing with an optical motion capture system. With the vast research fields provided by TACO, we benchmark three generalizable hand-object-interaction tasks: compositional action recognition, generalizable hand-object motion forecasting, and cooperative grasp synthesis. Extensive experiments reveal new insights, challenges, and opportunities for advancing the studies of generalizable hand-object motion analysis and synthesis. Our data and code are available at https://taco2024.github.io.
Paper Structure (30 sections, 13 equations, 20 figures, 7 tables)

This paper contains 30 sections, 13 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Data capturing system and camera views.
  • Figure 2: Automatic data annotating pipeline. The input consists of color frames from allocentric views, pre-scanned object models, and 3D positions of markers attached to object surfaces. We first separately localize 3D hand keypoints and obtain object poses, and then conduct contact-aware optimization to recover MANO mano hand meshes. We finally segment hands and objects from images and automatically inpaint markers to acquire realistic object appearances.
  • Figure 3: An example of marker removal. Three sub-figures respectively show the captured image patch, the automatically-computed marker mask, and the inpainted image patch.
  • Figure 4: Examples of interaction triplets in TACO. The left, middle, and right columns exemplify categories of tool, action, and target object, respectively. Triplets in our dataset are represented by connected paths from tools to target objects.
  • Figure 5: Qualitative contact optimization evaluation. From left to right: original RGB image, optimization without attraction loss, optimization without penetration loss, our method with attraction loss and penetration loss.
  • ...and 15 more figures