Table of Contents
Fetching ...

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, Luca Weihs

TL;DR

PoliFormer advances scalable on-policy reinforcement learning for long-horizon embodied navigation by introducing a fully transformer-based policy that processes RGB observations through a frozen vision backbone, a transformer state encoder, and a causal decoder with KV-cache for memory. Trained at scale with hundreds of parallel rollouts, it achieves state-of-the-art results across four simulation benchmarks in two robot embodiments and demonstrates strong zero-shot sim-to-real transfer on real-world deployments. A zero-shot extension, PoliFormer-BoxNav, shows promise as a navigation foundation model for downstream tasks such as open-vocabulary and multi-target navigation. These results highlight the potential of large-scale transformer policies in embodied AI and lay groundwork for general-purpose, promptable navigation systems.

Abstract

We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training with high throughput. PoliFormer is a masterful navigator, producing state-of-the-art results across two distinct embodiments, the LoCoBot and Stretch RE-1 robots, and four navigation benchmarks. It breaks through the plateaus of previous work, achieving an unprecedented 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. PoliFormer can also be trivially extended to a variety of downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no finetuning.

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

TL;DR

PoliFormer advances scalable on-policy reinforcement learning for long-horizon embodied navigation by introducing a fully transformer-based policy that processes RGB observations through a frozen vision backbone, a transformer state encoder, and a causal decoder with KV-cache for memory. Trained at scale with hundreds of parallel rollouts, it achieves state-of-the-art results across four simulation benchmarks in two robot embodiments and demonstrates strong zero-shot sim-to-real transfer on real-world deployments. A zero-shot extension, PoliFormer-BoxNav, shows promise as a navigation foundation model for downstream tasks such as open-vocabulary and multi-target navigation. These results highlight the potential of large-scale transformer policies in embodied AI and lay groundwork for general-purpose, promptable navigation systems.

Abstract

We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training with high throughput. PoliFormer is a masterful navigator, producing state-of-the-art results across two distinct embodiments, the LoCoBot and Stretch RE-1 robots, and four navigation benchmarks. It breaks through the plateaus of previous work, achieving an unprecedented 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. PoliFormer can also be trivially extended to a variety of downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no finetuning.
Paper Structure (21 sections, 6 figures, 4 tables)

This paper contains 21 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: PoliFormer, a transformer-based policy trained using RL at scale in simulation, achieves significant performance improvements in simulation (bottom-left) and the real world (bottom-right), across two embodiments. SR denotes Success Rate. We scale on-policy RL training across multiple dimensions: (top-left) we observe continual performance improvement with scaling RL training; (top-middle) we leverage hundreds of parallel rollouts for higher throughput; (top-right) we develop a transformer-based policy scaling model parameters to hundreds of millions.
  • Figure 2: PoliFormer is a fully transformer-based policy model. At each timestep $t$, it takes an ego-centric RGB observation $i^t$, extracts visual representations $r^t$ using a vision transformer model, further encodes state features $s^t$ using the visual representations and goal features $g$ (and optional detected bounding box goal features $g_b^t$), models state belief $b^t$ over time, employing a causal transformer decoder, and, finally, predicts action logits $a^t$ and a value estimation $e^t$ via linear actor and critic heads, respectively. For rollout collection and inference, we leverage the KV-cache pope2023efficiently as our temporal cache strategy to prevent recomputing the forward pass for all prior timesteps at each new timestep, saving memory and speeding up both training and inference.
  • Figure 3: We use PoliFormer-BoxNav zero-shot to find a book with a particular title, navigate to a kitchen, navigate to multiple objects sequentially, and follow a toy car around an office building.
  • Figure 4: Attention Masks for training with block lower triangular structure.
  • Figure 5: Different temporal cache strategies and their impact on the training speed. We ablate four different cache strategies, including (i) No-Cache, (ii) Feature-Cache, (iii) State-Cache, and (iv) KV-Cache, shown at top. The bottom chart shows the training Step per Second (SPS) achieved by different strategies, on both LoCoBot and Stretch RE-1 agents.
  • ...and 1 more figures