Multi-task real-robot data with gaze attention for dual-arm fine manipulation
Heecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi
TL;DR
The paper introduces a 224k-episode, dual-arm fine manipulation dataset with gaze-based visual attention and dual-action labels, addressing the gap in scalable, multi-task datasets for dual-arm manipulation. It proposes Dual-Action and Attention (DAA), a Transformer-based, language-conditioned policy that uses foveated vision and gaze to robustly learn global-reaching trajectories and precise local interactions. Real-robot experiments show the multi-task DAA model achieves strong generalization to new objects and robustness to lighting and background variations, outperforming task-specific baselines. The work provides a publicly available dataset and demonstrates the value of gaze-guided attention and dual-action for fine-grained dual-arm manipulation, while outlining directions for semantic reasoning and broader robot applicability.
Abstract
In the field of robotic manipulation, deep imitation learning is recognized as a promising approach for acquiring manipulation skills. Additionally, learning from diverse robot datasets is considered a viable method to achieve versatility and adaptability. In such research, by learning various tasks, robots achieved generality across multiple objects. However, such multi-task robot datasets have mainly focused on single-arm tasks that are relatively imprecise, not addressing the fine-grained object manipulation that robots are expected to perform in the real world. This paper introduces a dataset of diverse object manipulations that includes dual-arm tasks and/or tasks requiring fine manipulation. To this end, we have generated dataset with 224k episodes (150 hours, 1,104 language instructions) which includes dual-arm fine tasks such as bowl-moving, pencil-case opening or banana-peeling, and this data is publicly available. Additionally, this dataset includes visual attention signals as well as dual-action labels, a signal that separates actions into a robust reaching trajectory and precise interaction with objects, and language instructions to achieve robust and precise object manipulation. We applied the dataset to our Dual-Action and Attention (DAA), a model designed for fine-grained dual arm manipulation tasks and robust against covariate shifts. The model was tested with over 7k total trials in real robot manipulation tasks, demonstrating its capability in fine manipulation.
