Table of Contents
Fetching ...

Multi-task real-robot data with gaze attention for dual-arm fine manipulation

Heecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi

TL;DR

The paper introduces a 224k-episode, dual-arm fine manipulation dataset with gaze-based visual attention and dual-action labels, addressing the gap in scalable, multi-task datasets for dual-arm manipulation. It proposes Dual-Action and Attention (DAA), a Transformer-based, language-conditioned policy that uses foveated vision and gaze to robustly learn global-reaching trajectories and precise local interactions. Real-robot experiments show the multi-task DAA model achieves strong generalization to new objects and robustness to lighting and background variations, outperforming task-specific baselines. The work provides a publicly available dataset and demonstrates the value of gaze-guided attention and dual-action for fine-grained dual-arm manipulation, while outlining directions for semantic reasoning and broader robot applicability.

Abstract

In the field of robotic manipulation, deep imitation learning is recognized as a promising approach for acquiring manipulation skills. Additionally, learning from diverse robot datasets is considered a viable method to achieve versatility and adaptability. In such research, by learning various tasks, robots achieved generality across multiple objects. However, such multi-task robot datasets have mainly focused on single-arm tasks that are relatively imprecise, not addressing the fine-grained object manipulation that robots are expected to perform in the real world. This paper introduces a dataset of diverse object manipulations that includes dual-arm tasks and/or tasks requiring fine manipulation. To this end, we have generated dataset with 224k episodes (150 hours, 1,104 language instructions) which includes dual-arm fine tasks such as bowl-moving, pencil-case opening or banana-peeling, and this data is publicly available. Additionally, this dataset includes visual attention signals as well as dual-action labels, a signal that separates actions into a robust reaching trajectory and precise interaction with objects, and language instructions to achieve robust and precise object manipulation. We applied the dataset to our Dual-Action and Attention (DAA), a model designed for fine-grained dual arm manipulation tasks and robust against covariate shifts. The model was tested with over 7k total trials in real robot manipulation tasks, demonstrating its capability in fine manipulation.

Multi-task real-robot data with gaze attention for dual-arm fine manipulation

TL;DR

The paper introduces a 224k-episode, dual-arm fine manipulation dataset with gaze-based visual attention and dual-action labels, addressing the gap in scalable, multi-task datasets for dual-arm manipulation. It proposes Dual-Action and Attention (DAA), a Transformer-based, language-conditioned policy that uses foveated vision and gaze to robustly learn global-reaching trajectories and precise local interactions. Real-robot experiments show the multi-task DAA model achieves strong generalization to new objects and robustness to lighting and background variations, outperforming task-specific baselines. The work provides a publicly available dataset and demonstrates the value of gaze-guided attention and dual-action for fine-grained dual-arm manipulation, while outlining directions for semantic reasoning and broader robot applicability.

Abstract

In the field of robotic manipulation, deep imitation learning is recognized as a promising approach for acquiring manipulation skills. Additionally, learning from diverse robot datasets is considered a viable method to achieve versatility and adaptability. In such research, by learning various tasks, robots achieved generality across multiple objects. However, such multi-task robot datasets have mainly focused on single-arm tasks that are relatively imprecise, not addressing the fine-grained object manipulation that robots are expected to perform in the real world. This paper introduces a dataset of diverse object manipulations that includes dual-arm tasks and/or tasks requiring fine manipulation. To this end, we have generated dataset with 224k episodes (150 hours, 1,104 language instructions) which includes dual-arm fine tasks such as bowl-moving, pencil-case opening or banana-peeling, and this data is publicly available. Additionally, this dataset includes visual attention signals as well as dual-action labels, a signal that separates actions into a robust reaching trajectory and precise interaction with objects, and language instructions to achieve robust and precise object manipulation. We applied the dataset to our Dual-Action and Attention (DAA), a model designed for fine-grained dual arm manipulation tasks and robust against covariate shifts. The model was tested with over 7k total trials in real robot manipulation tasks, demonstrating its capability in fine manipulation.
Paper Structure (18 sections, 11 figures, 5 tables)

This paper contains 18 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Dataset and outline of Dual-Action and Attention.
  • Figure 2: Visual examples of tasks in the dataset.
  • Figure 3: Robot and teleoperation framework with an eye tracker.
  • Figure 4: Brief explanation of dual-action and gaze-based attention.
  • Figure 5: Neural network architectures. (a) The policy predictor outputs global-action and local-action through a Transformer encoder-decoder structure. It processes inputs such as high-resolution foveated vision attended by the gaze predictor, robot arm states, gaze coordinates, and language instructions. (b) The gaze predictor generates output by applying cross-attention to visual embeddings of global vision and language embeddings. The final gaze position is sampled based on the probability distribution of this output.
  • ...and 6 more figures