LeLaN: Learning A Language-Conditioned Navigation Policy from In-the-Wild Videos

Noriaki Hirose; Catherine Glossop; Ajay Sridhar; Dhruv Shah; Oier Mees; Sergey Levine

LeLaN: Learning A Language-Conditioned Navigation Policy from In-the-Wild Videos

Noriaki Hirose, Catherine Glossop, Ajay Sridhar, Dhruv Shah, Oier Mees, Sergey Levine

TL;DR

LeLaN(Learning Language-conditioned Navigation policy), a novel approach that consumes unlabeled, action-free egocentric data to learn scalable, language-conditioned object navigation, which outperforms state-of-the-art robot navigation methods while being capable of inference at 4 times their speed on edge compute.

Abstract

The world is filled with a wide variety of objects. For robots to be useful, they need the ability to find arbitrary objects described by people. In this paper, we present LeLaN(Learning Language-conditioned Navigation policy), a novel approach that consumes unlabeled, action-free egocentric data to learn scalable, language-conditioned object navigation. Our framework, LeLaN leverages the semantic knowledge of large vision-language models, as well as robotic foundation models, to label in-the-wild data from a variety of indoor and outdoor environments. We label over 130 hours of data collected in real-world indoor and outdoor environments, including robot observations, YouTube video tours, and human walking data. Extensive experiments with over 1000 real-world trials show that our approach enables training a policy from unlabeled action-free videos that outperforms state-of-the-art robot navigation methods, while being capable of inference at 4 times their speed on edge compute. We open-source our models, datasets and provide supplementary videos on our project page (https://learning-language-navigation.github.io/).

LeLaN: Learning A Language-Conditioned Navigation Policy from In-the-Wild Videos

TL;DR

Abstract

Paper Structure (29 sections, 4 equations, 22 figures, 5 tables)

This paper contains 29 sections, 4 equations, 22 figures, 5 tables.

Introduction
Related Works
Language-Conditioned Navigation from In-the-Wild Videos
Labeling In-the-Wild Videos with Foundation Models
Policy Architecture & Training
Training Data
Experiments
Evaluation on Diverse Language Instructions
Capability Analysis on Challenging Settings
Cross Embodiment Analysis
Data Ablations
Conclusion
Collision Avoidance with Supervision from Robotic Foundation Model
Breakdown of quantitative experiments
Model Ablations
...and 14 more sections

Figures (22)

Figure 1: LeLaN leverages foundation models to label in-the-wild video data with language instructions for object navigation. We train a state-of-the-art robot policy on this data for solving challenging zero-shot language-conditioned object navigation tasks across a variety of indoor and outdoor environments.
Figure 2: Data Annotation. To generate diverse language instructions for object navigation, we pass generic egocentric image observations through a series of large pre-trained model filters that extract object bounding boxes and masks. We then use a VLM to describe the object(s) in the bounding boxes, and use an LLM to produce several diverse object navigation labels.
Figure 3: Visualization of LeLaN performance. We conduct each experiment with 5 different prompts (right side) to visualize the robustness of LeLaN against noisy prompts. Our policy can navigate the robot toward the target object along very similar trajectories, showing its performance is highly reproducible.
Figure 4: Overview of cross embodiment evaluation. We conduct three type experiments to evaluate the generalized performance of our policy, [a] quadruped robot with PCB-mounted fisheye camera, [b] PCB-mounted fisheye camera, [c] canonical camera, and [d] spherical camera at higher pose
Figure 5: Data Ablation. An ablation of each dataset included in training data mixture, while keeping the entirety of the other datasets in the data mixture.
...and 17 more figures

LeLaN: Learning A Language-Conditioned Navigation Policy from In-the-Wild Videos

TL;DR

Abstract

LeLaN: Learning A Language-Conditioned Navigation Policy from In-the-Wild Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (22)