Table of Contents
Fetching ...

NOLO: Navigate Only Look Once

Bohan Zhou, Zhongbin Zhang, Jiangxing Wang, Zongqing Lu

TL;DR

Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability and adapts to new scenes by taking corresponding context videos as input without finetuning or re-training, is proposed.

Abstract

The in-context learning ability of Transformer models has brought new possibilities to visual navigation. In this paper, we focus on the video navigation setting, where an in-context navigation policy needs to be learned purely from videos in an offline manner, without access to the actual environment. For this setting, we propose Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability and adapts to new scenes by taking corresponding context videos as input without finetuning or re-training. To enable learning from videos, we first propose a pseudo action labeling procedure using optical flow to recover the action label from egocentric videos. Then, offline reinforcement learning is applied to learn the navigation policy. Through extensive experiments on different scenes both in simulation and the real world, we show that our algorithm outperforms baselines by a large margin, which demonstrates the in-context learning ability of the learned policy. For videos and more information, visit https://sites.google.com/view/nol0.

NOLO: Navigate Only Look Once

TL;DR

Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability and adapts to new scenes by taking corresponding context videos as input without finetuning or re-training, is proposed.

Abstract

The in-context learning ability of Transformer models has brought new possibilities to visual navigation. In this paper, we focus on the video navigation setting, where an in-context navigation policy needs to be learned purely from videos in an offline manner, without access to the actual environment. For this setting, we propose Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability and adapts to new scenes by taking corresponding context videos as input without finetuning or re-training. To enable learning from videos, we first propose a pseudo action labeling procedure using optical flow to recover the action label from egocentric videos. Then, offline reinforcement learning is applied to learn the navigation policy. Through extensive experiments on different scenes both in simulation and the real world, we show that our algorithm outperforms baselines by a large margin, which demonstrates the in-context learning ability of the learned policy. For videos and more information, visit https://sites.google.com/view/nol0.
Paper Structure (32 sections, 5 equations, 11 figures, 26 tables)

This paper contains 32 sections, 5 equations, 11 figures, 26 tables.

Figures (11)

  • Figure 1: The framework of NOLO. An offline collected egocentric video is taken by a pretrained GMFlow action decoder to label pseudo action sequence $\{\hat{a}_t\}_1^{T-1}$. An in-context video navigation policy $\pi_{\theta}(\cdot|g,o_t,\{f_t\}_1^T,\{\hat{a}_t\}_1^{T-1})$, modeled by VN$\circlearrowleft$Bert, is learned to take all context frames $\{f_t\}_1^T$, labeled actions $\{\hat{a}_t\}_1^{T-1}$, a current observation $o_t$ and a goal image $g$ to generate discrete actions to navigate to the desired target in a novel scene.
  • Figure 2: Two adjacent frames are taken by a pretrained optical flow model to get a flow map. Some representative dominant vectors are filtered for action selection.
  • Figure 3: Structure of VN$\circlearrowleft$Bert. At the initialization stage, a trajectory $\mathcal{T}$ and a zero-padded hidden state $h_\text{init}$ are processed by a bidirectional multi-head self-attention module to obtain context embedding $e^c$ and initial hidden state $h_0$. At each timestep after initialization, the current observation $o_t$ and goal frame $g$ are encoded into $e^s_{t}$ and are taken by a multi-head cross-attention module together with fixed $e^c$ and recurrently updated $h_t$ to produce policy $\pi_\theta$, Q-value $Q_\omega$, and terminal signal $\delta_\upsilon$. The red circle indicates additional aggregation via fusing element-wise product between features like VLN-Bert VLN-Bert.
  • Figure 4: Average SR and SPL of cross domain evaluation. H2R means training in Habitat and testing in RoboTHOR, and vice versa.
  • Figure 5: Visualizations of six video navigation tasks during in-context policy deployment. The leftmost bird-eye-view topological maps depicts the robot positions corresponding to the key observations to the right, and the second column displays the goal images.
  • ...and 6 more figures