PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

Kuo-Hao Zeng; Zichen Zhang; Kiana Ehsani; Rose Hendrix; Jordi Salvador; Alvaro Herrasti; Ross Girshick; Aniruddha Kembhavi; Luca Weihs

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, Luca Weihs

TL;DR

PoliFormer advances scalable on-policy reinforcement learning for long-horizon embodied navigation by introducing a fully transformer-based policy that processes RGB observations through a frozen vision backbone, a transformer state encoder, and a causal decoder with KV-cache for memory. Trained at scale with hundreds of parallel rollouts, it achieves state-of-the-art results across four simulation benchmarks in two robot embodiments and demonstrates strong zero-shot sim-to-real transfer on real-world deployments. A zero-shot extension, PoliFormer-BoxNav, shows promise as a navigation foundation model for downstream tasks such as open-vocabulary and multi-target navigation. These results highlight the potential of large-scale transformer policies in embodied AI and lay groundwork for general-purpose, promptable navigation systems.

Abstract

We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training with high throughput. PoliFormer is a masterful navigator, producing state-of-the-art results across two distinct embodiments, the LoCoBot and Stretch RE-1 robots, and four navigation benchmarks. It breaks through the plateaus of previous work, achieving an unprecedented 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. PoliFormer can also be trivially extended to a variety of downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no finetuning.

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

TL;DR

Abstract

Paper Structure (21 sections, 6 figures, 4 tables)

This paper contains 21 sections, 6 figures, 4 tables.

Introduction
Related Work
Method
The PoliFormer Architecture
Scalable RL Training Recipe
Scaling Environment Interactions
Results
PoliFormer Achieves SoTA on four Benchmarks
PoliFormer Generalizes to the Real World
Ablation Studies
Scaling PoliFormer to Everyday Tasks
Discussion
Details about Zero-shot Real-world Downstream Applications using an Open-Vocab Object Detector and VLM
Additional Training Details
Reward Shaping
...and 6 more sections

Figures (6)

Figure 1: PoliFormer, a transformer-based policy trained using RL at scale in simulation, achieves significant performance improvements in simulation (bottom-left) and the real world (bottom-right), across two embodiments. SR denotes Success Rate. We scale on-policy RL training across multiple dimensions: (top-left) we observe continual performance improvement with scaling RL training; (top-middle) we leverage hundreds of parallel rollouts for higher throughput; (top-right) we develop a transformer-based policy scaling model parameters to hundreds of millions.
Figure 2: PoliFormer is a fully transformer-based policy model. At each timestep $t$, it takes an ego-centric RGB observation $i^t$, extracts visual representations $r^t$ using a vision transformer model, further encodes state features $s^t$ using the visual representations and goal features $g$ (and optional detected bounding box goal features $g_b^t$), models state belief $b^t$ over time, employing a causal transformer decoder, and, finally, predicts action logits $a^t$ and a value estimation $e^t$ via linear actor and critic heads, respectively. For rollout collection and inference, we leverage the KV-cache pope2023efficiently as our temporal cache strategy to prevent recomputing the forward pass for all prior timesteps at each new timestep, saving memory and speeding up both training and inference.
Figure 3: We use PoliFormer-BoxNav zero-shot to find a book with a particular title, navigate to a kitchen, navigate to multiple objects sequentially, and follow a toy car around an office building.
Figure 4: Attention Masks for training with block lower triangular structure.
Figure 5: Different temporal cache strategies and their impact on the training speed. We ablate four different cache strategies, including (i) No-Cache, (ii) Feature-Cache, (iii) State-Cache, and (iv) KV-Cache, shown at top. The bottom chart shows the training Step per Second (SPS) achieved by different strategies, on both LoCoBot and Stretch RE-1 agents.
...and 1 more figures

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

TL;DR

Abstract

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

Authors

TL;DR

Abstract

Table of Contents

Figures (6)