OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

Karmesh Yadav; Arjun Majumdar; Ram Ramrakhya; Naoki Yokoyama; Alexei Baevski; Zsolt Kira; Oleksandr Maksymets; Dhruv Batra

OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

Karmesh Yadav, Arjun Majumdar, Ram Ramrakhya, Naoki Yokoyama, Alexei Baevski, Zsolt Kira, Oleksandr Maksymets, Dhruv Batra

TL;DR

This work introduces OVRL-v2, a model-free navigation agent built from task-agnostic components (Vision Transformer, a compression layer, and LSTM) that achieves state-of-the-art results on ImageNav and competitive performance on ObjectNav without mapping or detectors. A key insight is that preserving spatial structure via a compression layer enables ViTs to excel in visual navigation, and self-supervised MAE pretraining unlocks positive scaling across larger ViT architectures. The authors also identify and fix reward-hacking vulnerabilities with a principled, potential-based angle reward, further boosting path efficiency and success rates. Collectively, OVRL-v2 offers a simple, scalable, generalist baseline for embodied AI that outperforms prior single-camera methods and transfers well across tasks.

Abstract

We present a single neural network architecture composed of task-agnostic components (ViTs, convolutions, and LSTMs) that achieves state-of-art results on both the ImageNav ("go to location in <this picture>") and ObjectNav ("find a chair") tasks without any task-specific modules like object detection, segmentation, mapping, or planning modules. Such general-purpose methods offer advantages of simplicity in design, positive scaling with available compute, and versatile applicability to multiple tasks. Our work builds upon the recent success of self-supervised learning (SSL) for pre-training vision transformers (ViT). However, while the training recipes for convolutional networks are mature and robust, the recipes for ViTs are contingent and brittle, and in the case of ViTs for visual navigation, yet to be fully discovered. Specifically, we find that vanilla ViTs do not outperform ResNets on visual navigation. We propose the use of a compression layer operating over ViT patch representations to preserve spatial information along with policy training improvements. These improvements allow us to demonstrate positive scaling laws for the first time in visual navigation tasks. Consequently, our model advances state-of-the-art performance on ImageNav from 54.2% to 82.0% success and performs competitively against concurrent state-of-art on ObjectNav with success rate of 64.0% vs. 65.0%. Overall, this work does not present a fundamentally new approach, but rather recommendations for training a general-purpose architecture that achieves state-of-art performance today and could serve as a strong baseline for future methods.

OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

TL;DR

Abstract

Paper Structure (21 sections, 4 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 21 sections, 4 equations, 7 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Background: Tasks and Visual Pretraining
Visual Navigation
Masked Autoencoders (MAEs)
Approach
Experimental Findings
Establishing a Strong ImageNav Baseline
Using ViTs in a Visual Navigation Agent
Scaling with and without Visual Pretraining
Comparing Reward Functions
Comparisons with the ImageNav SoTA
Transferring Improvements to ObjectNav
Analysis and Ablations
Conclusion
...and 6 more sections

Figures (7)

Figure 1: OVRL-v2 is a model-free navigator with a ViT+LSTM architecture that achieves SoTA results on ImageNav and ObjectNav without mapping, detectors, or segmentors of any kind.
Figure 2: Visual Navigation Tasks. In ImageNavzhu2017target the goal is 'described' by an image and in ObjectNavbatra2020objectnav the goal is described in words (e.g., 'fridge'). We demonstrate the effectiveness of our 'model-free navigator' (i.e., agent) on both tasks.
Figure 3: Compression Layer. We propose using a compression layer to encode the output patches from a ViT encoder. The input to the compression layer are the H$\times$W output patches from ViT of size M each, where H and W are the number of patches along the height and width of the image. The patches are reshaped into a grid of size (H, W) and passed through a convolutional layer that compresses the size of the representation from M to N. The grid is then flattened and passed to the downstream model.
Figure 4: OVRL-v2 architecture. In our model-free navigator, observations $O_t$ are encoded using a from-scratch or pretrained ViT then fed to a compression layer (CL) and fully-connected layer (FC). The output representation is concatenated with a goal embedding and (optionally) a GPS+Compass encoding. Finally, an LSTM-based policy outputs actions $a_t$. In ImageNav, the visual encoder pipeline is replicated and used to encode goal images $O_g$. In ObjectNav, the embedding is used to encode categorical object goals (e.g., 'bed')
Figure 5: Reward hacking. With the ZER reward al2022zero (orange curve) agents learn to hack the reward leading to large increases in training reward (left), yet substantial drops in validation path efficiency or SPL (middle) and success rate or SR (right). This undesirable behavior is resolved with the reward function introduced in \ref{['eq:our-reward']} (blue curve), and performance steadily increases during training.
...and 2 more figures

OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

TL;DR

Abstract

OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

Authors

TL;DR

Abstract

Table of Contents

Figures (7)