Table of Contents
Fetching ...

Transformer-based deep imitation learning for dual-arm robot manipulation

Heecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi

TL;DR

This work tackles the challenge of distractions in deep imitation learning for dual-arm robot manipulation by introducing a Transformer-based self-attention mechanism that dynamically weighs gaze, left-arm, right-arm, and vision inputs. By fusing a gaze-prediction module with multi-sensory state representations and a Transformer encoder, the approach selectively attends to task-relevant inputs, yielding robust end-to-end policies. Real-robot experiments across uncoordinated, goal-coordinated, and bimanual tasks show performance gains over baselines lacking attention, and analysis of attention weights confirms input-dependent focusing behavior. The study suggests that transformer-based sensorimotor attention can be extended to more complex multi-arm systems and potentially enhanced with additional modalities like tactile feedback.

Abstract

Deep imitation learning is promising for solving dexterous manipulation tasks because it does not require an environment model and pre-programmed robot behavior. However, its application to dual-arm manipulation tasks remains challenging. In a dual-arm manipulation setup, the increased number of state dimensions caused by the additional robot manipulators causes distractions and results in poor performance of the neural networks. We address this issue using a self-attention mechanism that computes dependencies between elements in a sequential input and focuses on important elements. A Transformer, a variant of self-attention architecture, is applied to deep imitation learning to solve dual-arm manipulation tasks in the real world. The proposed method has been tested on dual-arm manipulation tasks using a real robot. The experimental results demonstrated that the Transformer-based deep imitation learning architecture can attend to the important features among the sensory inputs, therefore reducing distractions and improving manipulation performance when compared with the baseline architecture without the self-attention mechanisms. Data from this and related works are available at: https://sites.google.com/view/multi-task-fine.

Transformer-based deep imitation learning for dual-arm robot manipulation

TL;DR

This work tackles the challenge of distractions in deep imitation learning for dual-arm robot manipulation by introducing a Transformer-based self-attention mechanism that dynamically weighs gaze, left-arm, right-arm, and vision inputs. By fusing a gaze-prediction module with multi-sensory state representations and a Transformer encoder, the approach selectively attends to task-relevant inputs, yielding robust end-to-end policies. Real-robot experiments across uncoordinated, goal-coordinated, and bimanual tasks show performance gains over baselines lacking attention, and analysis of attention weights confirms input-dependent focusing behavior. The study suggests that transformer-based sensorimotor attention can be extended to more complex multi-arm systems and potentially enhanced with additional modalities like tactile feedback.

Abstract

Deep imitation learning is promising for solving dexterous manipulation tasks because it does not require an environment model and pre-programmed robot behavior. However, its application to dual-arm manipulation tasks remains challenging. In a dual-arm manipulation setup, the increased number of state dimensions caused by the additional robot manipulators causes distractions and results in poor performance of the neural networks. We address this issue using a self-attention mechanism that computes dependencies between elements in a sequential input and focuses on important elements. A Transformer, a variant of self-attention architecture, is applied to deep imitation learning to solve dual-arm manipulation tasks in the real world. The proposed method has been tested on dual-arm manipulation tasks using a real robot. The experimental results demonstrated that the Transformer-based deep imitation learning architecture can attend to the important features among the sensory inputs, therefore reducing distractions and improving manipulation performance when compared with the baseline architecture without the self-attention mechanisms. Data from this and related works are available at: https://sites.google.com/view/multi-task-fine.

Paper Structure

This paper contains 14 sections, 7 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Proposed Transformer-based deep imitation learning architecture for the dual-arm manipulation.
  • Figure 2: Neural network architectures.
  • Figure 3: Example of the proposed method on Pick. The robot first picked the toy apple ($\sim10.0$s), lifted it up ($15.0$s), and picked up the orange ($25.0$s).
  • Figure 4: Example of the proposed method on BoxPush. The robot placed its both arms behind the box ($\sim6.0$s), pushed it with the right arm ($7.5$s), and moved it with both arms to the goal position ($\sim20.0$s).
  • Figure 5: Example of the proposed method on ChangeHands. The robot first grasped the toy banana with its left hand ($\sim10.0$s), standed it up ($15.0$s), regrasped it with the right hand ($20.0$s $\sim$$25.0$s), and finally flipped it ($30$s).
  • ...and 9 more figures