Table of Contents
Fetching ...

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Xinhao Liu, Jintong Li, Yicheng Jiang, Niranjan Sujay, Zhicheng Yang, Juexiao Zhang, John Abanes, Jing Zhang, Chen Feng

TL;DR

CityWalker tackles embodied urban navigation in dynamic, map-free environments by leveraging thousands of hours of web-scale in-the-wild city walking and driving videos to learn imitation policies. Action supervision is obtained via visual odometry labels, enabling scalable learning without manual annotation, and the model uses a transformer-based pipeline with a frozen vision backbone and coordinate embeddings. Experiments show that data scale yields substantial gains, cross-domain data improves robustness, and CityWalker outperforms state-of-the-art baselines in offline and real-world tests. These results demonstrate the viability of leveraging abundant online urban videos to develop robust, scalable embodied navigation policies for robots operating in real cities.

Abstract

Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings. Project homepage is at https://ai4ce.github.io/CityWalker/.

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

TL;DR

CityWalker tackles embodied urban navigation in dynamic, map-free environments by leveraging thousands of hours of web-scale in-the-wild city walking and driving videos to learn imitation policies. Action supervision is obtained via visual odometry labels, enabling scalable learning without manual annotation, and the model uses a transformer-based pipeline with a frozen vision backbone and coordinate embeddings. Experiments show that data scale yields substantial gains, cross-domain data improves robustness, and CityWalker outperforms state-of-the-art baselines in offline and real-world tests. These results demonstrate the viability of leveraging abundant online urban videos to develop robust, scalable embodied navigation policies for robots operating in real cities.

Abstract

Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings. Project homepage is at https://ai4ce.github.io/CityWalker/.

Paper Structure

This paper contains 16 sections, 4 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Embodied Urban Navigation. Navigating urban spaces is challenging for (especially off-street) mobile agents. The differently colored pins () along the route highlight various critical scenarios unique to complex and dynamic urban landscapes. Thumbnails on the right with corresponding colored pins demonstrate the real-world observation of these challenging cases. Our CityWalker model is trained with over 2000 hours of city walking videos and fine-tuned with a small amount of expert data to address these challenges effectively.
  • Figure 2: Overall Illustration of CityWalker. Our training pipeline starts with internet-sourced videos, using visual odometry to obtain relative poses between frames. At each time step, the model receives past observations, past trajectory, and target location as input. They are encoded via a frozen image encoder and a trainable coordinate encoder. A transformer processes these inputs to generate future tokens. An action head and an arrival head decode these tokens into action and arrival status predictions. During training, future frame tokens from future frames guide the transformer to hallucinate future tokens.
  • Figure 3: Evaluation Metrics.Left: The orientation error is defined to be the angle between each predicted and ground truth action pair, labeled by $\theta_n$ in the figure. The action angle $\varphi_{\text{action}}$ and target angle $\varphi_{\text{target}}$ are defined with respect to the positive y-axis. Right: Both green and red trajectories are predicted actions. The green trajectory is the preferred one, having a large L2 distance but a small AOE. Vice versa for the red trajectory.
  • Figure 4: Data Sample and Visual Odometry (VO) Result. Our internet-source training data includes both walking and driving videos. These videos cover various scenarios in the urban environment. The VO tool gives noisy trajectories globally, but trustworthy local relative pose within a short time period.
  • Figure 5: Qualitative Results. Left image shows current observations of two samples. Right plots displays input trajectory, ground truth actions, and predicted actions in the current coordinate system with the agent at the origin.
  • ...and 5 more figures