Table of Contents
Fetching ...

Robust Imitation Learning for Mobile Manipulator Focusing on Task-Related Viewpoints and Regions

Yutaro Ishida, Yuki Noguchi, Takayuki Kanai, Kazuhiro Shintani, Hiroshi Bito

TL;DR

A robust imitation learning method for mobile manipulators that focuses on task-related viewpoints and their spatial regions when observing multiple viewpoints and brings optimal viewpoints and robust visual embedding against occlusion and domain shift is proposed.

Abstract

We study how to generalize the visuomotor policy of a mobile manipulator from the perspective of visual observations. The mobile manipulator is prone to occlusion owing to its own body when only a single viewpoint is employed and a significant domain shift when deployed in diverse situations. However, to the best of the authors' knowledge, no study has been able to solve occlusion and domain shift simultaneously and propose a robust policy. In this paper, we propose a robust imitation learning method for mobile manipulators that focuses on task-related viewpoints and their spatial regions when observing multiple viewpoints. The multiple viewpoint policy includes attention mechanism, which is learned with an augmented dataset, and brings optimal viewpoints and robust visual embedding against occlusion and domain shift. Comparison of our results for different tasks and environments with those of previous studies revealed that our proposed method improves the success rate by up to 29.3 points. We also conduct ablation studies using our proposed method. Learning task-related viewpoints from the multiple viewpoints dataset increases robustness to occlusion than using a uniquely defined viewpoint. Focusing on task-related regions contributes to up to a 33.3-point improvement in the success rate against domain shift.

Robust Imitation Learning for Mobile Manipulator Focusing on Task-Related Viewpoints and Regions

TL;DR

A robust imitation learning method for mobile manipulators that focuses on task-related viewpoints and their spatial regions when observing multiple viewpoints and brings optimal viewpoints and robust visual embedding against occlusion and domain shift is proposed.

Abstract

We study how to generalize the visuomotor policy of a mobile manipulator from the perspective of visual observations. The mobile manipulator is prone to occlusion owing to its own body when only a single viewpoint is employed and a significant domain shift when deployed in diverse situations. However, to the best of the authors' knowledge, no study has been able to solve occlusion and domain shift simultaneously and propose a robust policy. In this paper, we propose a robust imitation learning method for mobile manipulators that focuses on task-related viewpoints and their spatial regions when observing multiple viewpoints. The multiple viewpoint policy includes attention mechanism, which is learned with an augmented dataset, and brings optimal viewpoints and robust visual embedding against occlusion and domain shift. Comparison of our results for different tasks and environments with those of previous studies revealed that our proposed method improves the success rate by up to 29.3 points. We also conduct ablation studies using our proposed method. Learning task-related viewpoints from the multiple viewpoints dataset increases robustness to occlusion than using a uniquely defined viewpoint. Focusing on task-related regions contributes to up to a 33.3-point improvement in the success rate against domain shift.
Paper Structure (16 sections, 12 figures, 9 tables)

This paper contains 16 sections, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Example of occlusion of internal viewpoints on the mobile manipulators. Left: first-person viewpoint is occluded by the body of the MM in pick task. Right: in-hand viewpoint is occluded by the grasped object in place task.
  • Figure 2: Example of visual observation domain shift. Left: environments for training the policy. Middle: distractor objects cause the minor change. Right: unknown furniture causes the major change.
  • Figure 3: Attention mechanism for multiple viewpoints and their spatial regions. By weighting the features with spatial attention, the information of task-related viewpoints and their spatial regions are extracted in image encoders from multiple visual observations. Since spatial attention is the learnable parameter, our method can learn task-related viewpoints from dataset instead of uniquely defined by hand-craft.
  • Figure 4: Processing steps of fast and low computational resource augmentation using fractal texture. By detecting and tracking task-related regions, non-task-related regions are augmented with fractal textures. The augmentation facilitates the learning of attention mechanism that focuses strongly on task-related regions which less changed, rather than non-task-related regions that are changed greater with fractal texture.
  • Figure 5: Overview of the pick-bottle-from-shelf task. Figures are lined in time-step order from left to right. Left: the MM started with $o_{h}$ and $o_{f}$ facing the bottle placed on the shelf. Middle: the MM moved the mobile base and arm to reach the bottle. Right: the MM picked up the bottle from the shelf.
  • ...and 7 more figures