Table of Contents
Fetching ...

OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion

Xinyu Zhan, Lixin Yang, Yifei Zhao, Kangrui Mao, Hanlin Xu, Zenan Lin, Kailin Li, Cewu Lu

TL;DR

A task-oriented framework for Complex Task Completion (CTC) is explored, which aims to generate a sequence of bimanual manipulation to achieve task objectives and employs Large Language Models to decompose the complex task objectives into sequences of Primitive Tasks.

Abstract

We present OAKINK2, a dataset of bimanual object manipulation tasks for complex daily activities. In pursuit of constructing the complex tasks into a structured representation, OAKINK2 introduces three level of abstraction to organize the manipulation tasks: Affordance, Primitive Task, and Complex Task. OAKINK2 features on an object-centric perspective for decoding the complex tasks, treating them as a sequence of object affordance fulfillment. The first level, Affordance, outlines the functionalities that objects in the scene can afford, the second level, Primitive Task, describes the minimal interaction units that humans interact with the object to achieve its affordance, and the third level, Complex Task, illustrates how Primitive Tasks are composed and interdependent. OAKINK2 dataset provides multi-view image streams and precise pose annotations for the human body, hands and various interacting objects. This extensive collection supports applications such as interaction reconstruction and motion synthesis. Based on the 3-level abstraction of OAKINK2, we explore a task-oriented framework for Complex Task Completion (CTC). CTC aims to generate a sequence of bimanual manipulation to achieve task objectives. Within the CTC framework, we employ Large Language Models (LLMs) to decompose the complex task objectives into sequences of Primitive Tasks and have developed a Motion Fulfillment Model that generates bimanual hand motion for each Primitive Task. OAKINK2 datasets and models are available at https://oakink.net/v2.

OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion

TL;DR

A task-oriented framework for Complex Task Completion (CTC) is explored, which aims to generate a sequence of bimanual manipulation to achieve task objectives and employs Large Language Models to decompose the complex task objectives into sequences of Primitive Tasks.

Abstract

We present OAKINK2, a dataset of bimanual object manipulation tasks for complex daily activities. In pursuit of constructing the complex tasks into a structured representation, OAKINK2 introduces three level of abstraction to organize the manipulation tasks: Affordance, Primitive Task, and Complex Task. OAKINK2 features on an object-centric perspective for decoding the complex tasks, treating them as a sequence of object affordance fulfillment. The first level, Affordance, outlines the functionalities that objects in the scene can afford, the second level, Primitive Task, describes the minimal interaction units that humans interact with the object to achieve its affordance, and the third level, Complex Task, illustrates how Primitive Tasks are composed and interdependent. OAKINK2 dataset provides multi-view image streams and precise pose annotations for the human body, hands and various interacting objects. This extensive collection supports applications such as interaction reconstruction and motion synthesis. Based on the 3-level abstraction of OAKINK2, we explore a task-oriented framework for Complex Task Completion (CTC). CTC aims to generate a sequence of bimanual manipulation to achieve task objectives. Within the CTC framework, we employ Large Language Models (LLMs) to decompose the complex task objectives into sequences of Primitive Tasks and have developed a Motion Fulfillment Model that generates bimanual hand motion for each Primitive Task. OAKINK2 datasets and models are available at https://oakink.net/v2.
Paper Structure (33 sections, 3 equations, 17 figures, 6 tables)

This paper contains 33 sections, 3 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: An overview of the data and content of our proposed OakInk2 dataset.OakInk2 dataset focuses on bimanual object manipulation tasks for complex daily activities. 1) The top row shows the data collection process, including the task setup (top-left panel), human demonstration (top-center), and annotation (top-right). 2) The second row shows the three levels of abstraction constructed by OakInk2 for complex tasks, including the Affordance, Primitive Task, and Complex Task. OakInk2 dataset provides allocentric and egocentric videos of human manipulation process, as well as the corresponding 3D-pose annotation and task specification.
  • Figure 2: Illustration of the complex task acquisition process. This figure use a Complex Task: 'Prepare a bowl of hot sweet fruit tea.' to demonstrate the process. Initially, the annotators () analyze the affordances of four essential objects (a gripper, a knife, a tea bottle, and a microwave oven) and design corresponding Primitive. For instance, to prepare fruit slices, the Primitive: cut associated with the knife blade is required. Following this, an expert () arranges the scene for the Complex Task, and then the subject (), utilizing the designed Primitive, plans the execution path of the Complex Task. Later, these execution paths are structured into a Primitive Dependency Graphs.
  • Figure 3: Capture platform.$12$ MoCap cameras are circled in blue and $4$ RGB cameras in red.
  • Figure 4: Commentary of the task execution. The left column shows the current state of the scene. The center column shows the narrative dialog retrieved from experts. The right column shows the upcoming Primitive task to be executed.
  • Figure 5: Architecture of MF-MDM. First sample random noises $\boldsymbol{x}_T$; then at each step iterating from $T$ to $1$, MF-MDM G predicts the cleaned sample $\hat{\boldsymbol{x}}_0$ and then diffuse it back to $\boldsymbol{x}_{t-1}$. After the generated sample $\boldsymbol{x}_0$ is acquired, it is refined by MF-MDM R for better interaction details.
  • ...and 12 more figures