Table of Contents
Fetching ...

Adaptive Visual Imitation Learning for Robotic Assisted Feeding Across Varied Bowl Configurations and Food Types

Rui Liu, Amisha Bhaskar, Pratap Tokekar

TL;DR

This work tackles robust, adaptive robotic feeding by learning a visuomotor policy that generalizes across diverse bowl configurations and food types. AVIL integrates a spatial attention module with vision and proprioception embeddings to map RGB observations and joint states to multi-step joint actions, trained via behavior cloning on demonstrations. The approach achieves up to 2.5x improvement over a handcrafted baseline, demonstrates zero-shot generalization from data collected in a single bowl, and remains robust in the presence of distractors. The results indicate that combining spatially attentive perception with imitation learning can enable practical, versatile robot-assisted feeding in real-world environments.

Abstract

In this study, we introduce a novel visual imitation network with a spatial attention module for robotic assisted feeding (RAF). The goal is to acquire (i.e., scoop) food items from a bowl. However, achieving robust and adaptive food manipulation is particularly challenging. To deal with this, we propose a framework that integrates visual perception with imitation learning to enable the robot to handle diverse scenarios during scooping. Our approach, named AVIL (adaptive visual imitation learning), exhibits adaptability and robustness across different bowl configurations in terms of material, size, and position, as well as diverse food types including granular, semi-solid, and liquid, even in the presence of distractors. We validate the effectiveness of our approach by conducting experiments on a real robot. We also compare its performance with a baseline. The results demonstrate improvement over the baseline across all scenarios, with an enhancement of up to 2.5 times in terms of a success metric. Notably, our model, trained solely on data from a transparent glass bowl containing granular cereals, showcases generalization ability when tested zero-shot on other bowl configurations with different types of food.

Adaptive Visual Imitation Learning for Robotic Assisted Feeding Across Varied Bowl Configurations and Food Types

TL;DR

This work tackles robust, adaptive robotic feeding by learning a visuomotor policy that generalizes across diverse bowl configurations and food types. AVIL integrates a spatial attention module with vision and proprioception embeddings to map RGB observations and joint states to multi-step joint actions, trained via behavior cloning on demonstrations. The approach achieves up to 2.5x improvement over a handcrafted baseline, demonstrates zero-shot generalization from data collected in a single bowl, and remains robust in the presence of distractors. The results indicate that combining spatially attentive perception with imitation learning can enable practical, versatile robot-assisted feeding in real-world environments.

Abstract

In this study, we introduce a novel visual imitation network with a spatial attention module for robotic assisted feeding (RAF). The goal is to acquire (i.e., scoop) food items from a bowl. However, achieving robust and adaptive food manipulation is particularly challenging. To deal with this, we propose a framework that integrates visual perception with imitation learning to enable the robot to handle diverse scenarios during scooping. Our approach, named AVIL (adaptive visual imitation learning), exhibits adaptability and robustness across different bowl configurations in terms of material, size, and position, as well as diverse food types including granular, semi-solid, and liquid, even in the presence of distractors. We validate the effectiveness of our approach by conducting experiments on a real robot. We also compare its performance with a baseline. The results demonstrate improvement over the baseline across all scenarios, with an enhancement of up to 2.5 times in terms of a success metric. Notably, our model, trained solely on data from a transparent glass bowl containing granular cereals, showcases generalization ability when tested zero-shot on other bowl configurations with different types of food.
Paper Structure (27 sections, 2 equations, 11 figures)

This paper contains 27 sections, 2 equations, 11 figures.

Figures (11)

  • Figure 1: Learning pipeline diagram of our approach (AVIL) for spoon scooping in RAF.
  • Figure 2: Proposed visual imitation network.
  • Figure 3: Qualitative results of images with various bowl configurations, positions, food types, and with distractors on the table along with their corresponding spatial attention maps.
  • Figure 4: The experimental setup, which includes a UR5e robot arm, a custom-designed spoon attachment, and a stationary RealSense camera. P1, P2, P3 denote different bowl positions on the table.
  • Figure 5: Different bowl configurations exhibit variations in material, size, and color. TG denotes transparent glass, PS denotes plastic small, PM denotes plastic medium, and PL denotes plastic large.
  • ...and 6 more figures