Table of Contents
Fetching ...

Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models

Nils Blank, Moritz Reuss, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Wenzel, Oier Mees, Rudolf Lioutikov

TL;DR

Evaluations show that NILS can autonomously annotate diverse robot demonstrations of unlabeled and unstructured datasets while alleviating several shortcomings of crowdsourced human annotations, such as low data quality and diversity.

Abstract

A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated language or expensive human-labeled instructions, hindering their scalability. To this end, we introduce NILS: Natural language Instruction Labeling for Scalability. NILS automatically labels uncurated, long-horizon robot data at scale in a zero-shot manner without any human intervention. NILS combines pretrained vision-language foundation models in order to detect objects in a scene, detect object-centric changes, segment tasks from large datasets of unlabelled interaction data and ultimately label behavior datasets. Evaluations on BridgeV2, Fractal, and a kitchen play dataset show that NILS can autonomously annotate diverse robot demonstrations of unlabeled and unstructured datasets while alleviating several shortcomings of crowdsourced human annotations, such as low data quality and diversity. We use NILS to label over 115k trajectories obtained from over 430 hours of robot data. We open-source our auto-labeling code and generated annotations on our website: http://robottasklabeling.github.io.

Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models

TL;DR

Evaluations show that NILS can autonomously annotate diverse robot demonstrations of unlabeled and unstructured datasets while alleviating several shortcomings of crowdsourced human annotations, such as low data quality and diversity.

Abstract

A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated language or expensive human-labeled instructions, hindering their scalability. To this end, we introduce NILS: Natural language Instruction Labeling for Scalability. NILS automatically labels uncurated, long-horizon robot data at scale in a zero-shot manner without any human intervention. NILS combines pretrained vision-language foundation models in order to detect objects in a scene, detect object-centric changes, segment tasks from large datasets of unlabelled interaction data and ultimately label behavior datasets. Evaluations on BridgeV2, Fractal, and a kitchen play dataset show that NILS can autonomously annotate diverse robot demonstrations of unlabeled and unstructured datasets while alleviating several shortcomings of crowdsourced human annotations, such as low data quality and diversity. We use NILS to label over 115k trajectories obtained from over 430 hours of robot data. We open-source our auto-labeling code and generated annotations on our website: http://robottasklabeling.github.io.

Paper Structure

This paper contains 42 sections, 1 equation, 20 figures, 10 tables.

Figures (20)

  • Figure 1: A framework to label long-horizon robot demonstrations without human annotations or model training from RGB videos. NILS leverages an ensemble of frozen pretrained models to segment and annotate uncurated, long-horizon demonstrations. The resulting labeled and segmented dataset can be used to train language-conditioned policies without human annotation.
  • Figure 2: Overview of the proposed NILS framework for labeling long-horizon robot play sequences in a zero-shot manner using an ensemble of pretrained expert models. NILS consists of three Stages: First, all relevant objects in the video are detected. In the second step, object-centric changes are detected and collected. In Stage 3 the object change information is used to detect keystates and an LLM is prompted to generate a language label for the task.
  • Figure 3: Overview of the environments used in our experiments. From left to right: Toy kitchen setup, two scenes from the BridgeV2 dataset walke2024bridgedata, one example task from the Simpler Eval li24simpler and one scene from Fractal brohan2022rt.
  • Figure 4: Keystate accuracy for different frame distance tolerances on: Kitchen Play and BridgeV2. We report the precision and recall of our method at two different keystate thresholds. generates relevant keystates on both datasets and surpasses both baselines.
  • Figure 5: Static object detection refinement. (a) shows the initial noisy labels. The boxes are then filtered by removing statistical outliers (b) and by obtaining the highest confidence cluster (c). The final averaged box is visible in (d)
  • ...and 15 more figures