Table of Contents
Fetching ...

LeLaN: Learning A Language-Conditioned Navigation Policy from In-the-Wild Videos

Noriaki Hirose, Catherine Glossop, Ajay Sridhar, Dhruv Shah, Oier Mees, Sergey Levine

TL;DR

LeLaN(Learning Language-conditioned Navigation policy), a novel approach that consumes unlabeled, action-free egocentric data to learn scalable, language-conditioned object navigation, which outperforms state-of-the-art robot navigation methods while being capable of inference at 4 times their speed on edge compute.

Abstract

The world is filled with a wide variety of objects. For robots to be useful, they need the ability to find arbitrary objects described by people. In this paper, we present LeLaN(Learning Language-conditioned Navigation policy), a novel approach that consumes unlabeled, action-free egocentric data to learn scalable, language-conditioned object navigation. Our framework, LeLaN leverages the semantic knowledge of large vision-language models, as well as robotic foundation models, to label in-the-wild data from a variety of indoor and outdoor environments. We label over 130 hours of data collected in real-world indoor and outdoor environments, including robot observations, YouTube video tours, and human walking data. Extensive experiments with over 1000 real-world trials show that our approach enables training a policy from unlabeled action-free videos that outperforms state-of-the-art robot navigation methods, while being capable of inference at 4 times their speed on edge compute. We open-source our models, datasets and provide supplementary videos on our project page (https://learning-language-navigation.github.io/).

LeLaN: Learning A Language-Conditioned Navigation Policy from In-the-Wild Videos

TL;DR

LeLaN(Learning Language-conditioned Navigation policy), a novel approach that consumes unlabeled, action-free egocentric data to learn scalable, language-conditioned object navigation, which outperforms state-of-the-art robot navigation methods while being capable of inference at 4 times their speed on edge compute.

Abstract

The world is filled with a wide variety of objects. For robots to be useful, they need the ability to find arbitrary objects described by people. In this paper, we present LeLaN(Learning Language-conditioned Navigation policy), a novel approach that consumes unlabeled, action-free egocentric data to learn scalable, language-conditioned object navigation. Our framework, LeLaN leverages the semantic knowledge of large vision-language models, as well as robotic foundation models, to label in-the-wild data from a variety of indoor and outdoor environments. We label over 130 hours of data collected in real-world indoor and outdoor environments, including robot observations, YouTube video tours, and human walking data. Extensive experiments with over 1000 real-world trials show that our approach enables training a policy from unlabeled action-free videos that outperforms state-of-the-art robot navigation methods, while being capable of inference at 4 times their speed on edge compute. We open-source our models, datasets and provide supplementary videos on our project page (https://learning-language-navigation.github.io/).
Paper Structure (29 sections, 4 equations, 22 figures, 5 tables)

This paper contains 29 sections, 4 equations, 22 figures, 5 tables.

Figures (22)

  • Figure 1: LeLaN leverages foundation models to label in-the-wild video data with language instructions for object navigation. We train a state-of-the-art robot policy on this data for solving challenging zero-shot language-conditioned object navigation tasks across a variety of indoor and outdoor environments.
  • Figure 2: Data Annotation. To generate diverse language instructions for object navigation, we pass generic egocentric image observations through a series of large pre-trained model filters that extract object bounding boxes and masks. We then use a VLM to describe the object(s) in the bounding boxes, and use an LLM to produce several diverse object navigation labels.
  • Figure 3: Visualization of LeLaN performance. We conduct each experiment with 5 different prompts (right side) to visualize the robustness of LeLaN against noisy prompts. Our policy can navigate the robot toward the target object along very similar trajectories, showing its performance is highly reproducible.
  • Figure 4: Overview of cross embodiment evaluation. We conduct three type experiments to evaluate the generalized performance of our policy, [a] quadruped robot with PCB-mounted fisheye camera, [b] PCB-mounted fisheye camera, [c] canonical camera, and [d] spherical camera at higher pose
  • Figure 5: Data Ablation. An ablation of each dataset included in training data mixture, while keeping the entirety of the other datasets in the data mixture.
  • ...and 17 more figures