CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Xinhao Liu; Jintong Li; Yicheng Jiang; Niranjan Sujay; Zhicheng Yang; Juexiao Zhang; John Abanes; Jing Zhang; Chen Feng

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Xinhao Liu, Jintong Li, Yicheng Jiang, Niranjan Sujay, Zhicheng Yang, Juexiao Zhang, John Abanes, Jing Zhang, Chen Feng

TL;DR

CityWalker tackles embodied urban navigation in dynamic, map-free environments by leveraging thousands of hours of web-scale in-the-wild city walking and driving videos to learn imitation policies. Action supervision is obtained via visual odometry labels, enabling scalable learning without manual annotation, and the model uses a transformer-based pipeline with a frozen vision backbone and coordinate embeddings. Experiments show that data scale yields substantial gains, cross-domain data improves robustness, and CityWalker outperforms state-of-the-art baselines in offline and real-world tests. These results demonstrate the viability of leveraging abundant online urban videos to develop robust, scalable embodied navigation policies for robots operating in real cities.

Abstract

Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings. Project homepage is at https://ai4ce.github.io/CityWalker/.

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

TL;DR

Abstract

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)