Table of Contents
Fetching ...

Skills Made to Order: Efficient Acquisition of Robot Cooking Skills Guided by Multiple Forms of Internet Data

Mrinal Verghese, Christopher Atkeson

TL;DR

The results show that LLMs are surprisingly capable template selectors despite their lack of visual information, optical flow encoding significantly outperforms video encoders trained with an order of magnitude more data, and important synergies exist between various forms of internet data for template selection.

Abstract

This study explores the utility of various internet data sources to select among a set of template robot behaviors to perform skills. Learning contact-rich skills involving tool use from internet data sources has typically been challenging due to the lack of physical information such as contact existence, location, areas, and force in this data. Prior works have generally used internet data and foundation models trained on this data to generate low-level robot behavior. We hypothesize that these data and models may be better suited to selecting among a set of basic robot behaviors to perform these contact-rich skills. We explore three methods of template selection: querying large language models, comparing video of robot execution to retrieved human video using features from a pretrained video encoder common in prior work, and performing the same comparison using features from an optic flow encoder trained on internet data. Our results show that LLMs are surprisingly capable template selectors despite their lack of visual information, optical flow encoding significantly outperforms video encoders trained with an order of magnitude more data, and important synergies exist between various forms of internet data for template selection. By exploiting these synergies, we create a template selector using multiple forms of internet data that achieves a 79\% success rate on a set of 16 different cooking skills involving tool-use.

Skills Made to Order: Efficient Acquisition of Robot Cooking Skills Guided by Multiple Forms of Internet Data

TL;DR

The results show that LLMs are surprisingly capable template selectors despite their lack of visual information, optical flow encoding significantly outperforms video encoders trained with an order of magnitude more data, and important synergies exist between various forms of internet data for template selection.

Abstract

This study explores the utility of various internet data sources to select among a set of template robot behaviors to perform skills. Learning contact-rich skills involving tool use from internet data sources has typically been challenging due to the lack of physical information such as contact existence, location, areas, and force in this data. Prior works have generally used internet data and foundation models trained on this data to generate low-level robot behavior. We hypothesize that these data and models may be better suited to selecting among a set of basic robot behaviors to perform these contact-rich skills. We explore three methods of template selection: querying large language models, comparing video of robot execution to retrieved human video using features from a pretrained video encoder common in prior work, and performing the same comparison using features from an optic flow encoder trained on internet data. Our results show that LLMs are surprisingly capable template selectors despite their lack of visual information, optical flow encoding significantly outperforms video encoders trained with an order of magnitude more data, and important synergies exist between various forms of internet data for template selection. By exploiting these synergies, we create a template selector using multiple forms of internet data that achieves a 79\% success rate on a set of 16 different cooking skills involving tool-use.
Paper Structure (19 sections, 1 equation, 4 figures, 1 table)

This paper contains 19 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Visualization of 4 different cooking skills: cutting a bell pepper, peeling a carrot, scraping a cutting board, and stirring a pan. Each of these skills was performed by a template selected by our best-performing approach, combining both language and learned optic flow features from video.
  • Figure 2: Overview of three different approaches for using internet data and models to select templates. Template selection via LLM is shown in red. The LLM is also used to select a set of candidate templates, which are then executed and the video of that execution is compared to retrieved videos of of humans performing the skill using pretrained video encoders (green) or a learned optic flow encoder (blue).
  • Figure 3: Progress of the wiping (left) and scraping (right) skill execution for various template selection methods. The X-axis is the normalized episode length. The Y-axis is task progress. Task progress is determined for wiping as the percentage of the recipient object that the cloth has wiped in pixel space. Task progress is determined for scraping as the amount of orange pepper at the edge of the cutting board in pixel space.
  • Figure 4: Quality scores from human evaluators for each type of skill. Human evaluators were asked to rate how well the template chosen by each approach performed the skill. The approaches tested were an LLM-based template selector, template selection by comparing executed templates to human video using features from a pretrained video encoder (LaViLa), the same comparison using learned optic flow features, and a combination of LLM and flow.