Table of Contents
Fetching ...

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

Honglin He, Yukai Ma, Wayne Wu, Bolei Zhou

TL;DR

Offline navigation foundation models gain broad perceptual priors from massive video data but struggle with interactive, safe real-world operation. S2E integrates Anchor-Guided Distribution Matching pretraining with a Residual-Attention Module-enabled RL post-training to inject reactive behaviors while preserving pretrained knowledge. The NavBench-GS and real-world evaluations demonstrate that RL-based post-training yields significant performance gains and robustness beyond offline data scaling, achieving zero-shot generalization across embodiments. This work highlights the critical role of interactive online experiences for scaling robotics foundation models in urban navigation and beyond.

Abstract

Navigation foundation models trained on massive webscale data enable agents to generalize across diverse environments and embodiments. However, these models trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in the real-world urban navigation where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pre-training on videos and post-training through RL. It maintains the generalizability acquired from large-scale real-world videos while enhancing its interactivity through RL in simulation environments. Specifically, we introduce two innovations: an Anchor-Guided Distribution Matching strategy, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and a Residual-Attention Module, which obtains reactive behaviors from simulation environments without erasing the model's pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3DGS reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models. Extensive experiments show that S2E mitigates the diminishing returns often seen when scaling with offline data alone. We perform a thorough analysis of the benefits of Reinforcement Learning compared to Supervised Fine-Tuning in the context of post-training for robot learning. Our findings emphasize the crucial role of integrating interactive online experiences to effectively scale foundation models in Robotics.

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

TL;DR

Offline navigation foundation models gain broad perceptual priors from massive video data but struggle with interactive, safe real-world operation. S2E integrates Anchor-Guided Distribution Matching pretraining with a Residual-Attention Module-enabled RL post-training to inject reactive behaviors while preserving pretrained knowledge. The NavBench-GS and real-world evaluations demonstrate that RL-based post-training yields significant performance gains and robustness beyond offline data scaling, achieving zero-shot generalization across embodiments. This work highlights the critical role of interactive online experiences for scaling robotics foundation models in urban navigation and beyond.

Abstract

Navigation foundation models trained on massive webscale data enable agents to generalize across diverse environments and embodiments. However, these models trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in the real-world urban navigation where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pre-training on videos and post-training through RL. It maintains the generalizability acquired from large-scale real-world videos while enhancing its interactivity through RL in simulation environments. Specifically, we introduce two innovations: an Anchor-Guided Distribution Matching strategy, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and a Residual-Attention Module, which obtains reactive behaviors from simulation environments without erasing the model's pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3DGS reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models. Extensive experiments show that S2E mitigates the diminishing returns often seen when scaling with offline data alone. We perform a thorough analysis of the benefits of Reinforcement Learning compared to Supervised Fine-Tuning in the context of post-training for robot learning. Our findings emphasize the crucial role of integrating interactive online experiences to effectively scale foundation models in Robotics.

Paper Structure

This paper contains 37 sections, 6 equations, 19 figures, 11 tables.

Figures (19)

  • Figure 1: Real-world deployments of S2E. S2E achieves zero-shot generalization across environments and embodiments. We demonstrate its effectiveness on wheeled and quadruped robots in diverse urban scenarios.
  • Figure 2: Motivation. Like humans, AI agents must also go through interactive practices and learn from feedback to obtain actionable skills.
  • Figure 2: Real-world results.
  • Figure 3: Illustration of S2E framework. The model receives continuous RGB frames and the target position as context information and utilizes pre-defined embodiment-agnostic anchors as queries for prediction. First, context embeddings are integrated via a self-attention module. The outputs are then used as keys (K) and values (V). Meanwhile, the anchor features $\boldsymbol{f}_{\mathcal{P}}$ serve as queries (Q). Subsequently, RAM blocks compute weighted features from K and V based on the anchor queries Q, and produce refined anchor features. A classification and a regression head decode the anchor features to predict scores and normalized trajectories with a velocity scale. In the pretraining stage, the model is trained end-to-end with NLL and regression loss (Equation \ref{['eq:nll']}). In the finetuning stage, only the parameters within the RAM blocks are optimized with policy gradient.
  • Figure 3: Effectiveness of RAM.
  • ...and 14 more figures