PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators
Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, Luca Weihs
TL;DR
PoliFormer advances scalable on-policy reinforcement learning for long-horizon embodied navigation by introducing a fully transformer-based policy that processes RGB observations through a frozen vision backbone, a transformer state encoder, and a causal decoder with KV-cache for memory. Trained at scale with hundreds of parallel rollouts, it achieves state-of-the-art results across four simulation benchmarks in two robot embodiments and demonstrates strong zero-shot sim-to-real transfer on real-world deployments. A zero-shot extension, PoliFormer-BoxNav, shows promise as a navigation foundation model for downstream tasks such as open-vocabulary and multi-target navigation. These results highlight the potential of large-scale transformer policies in embodied AI and lay groundwork for general-purpose, promptable navigation systems.
Abstract
We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training with high throughput. PoliFormer is a masterful navigator, producing state-of-the-art results across two distinct embodiments, the LoCoBot and Stretch RE-1 robots, and four navigation benchmarks. It breaks through the plateaus of previous work, achieving an unprecedented 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. PoliFormer can also be trivially extended to a variety of downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no finetuning.
