Heuristic Transformer: Belief Augmented In-Context Reinforcement Learning
Oliver Dippel, Alexei Lisitsa, Bei Peng
TL;DR
This work introduces Heuristic Transformer (HT), an in-context reinforcement learning method that augments transformer policies with a learned belief over reward functions. HT uses a variational autoencoder to produce a low-dimensional latent embedding $m$ and a posterior $b_h = q(m|tet{\eta}_{:h})$, which is incorporated into the transformer prompt along with an in-context dataset $D_{pre}$ and a query state $s_{query}$; Phase 1 trains the belief model via an ELBO objective, and Phase 2 trains a GPT-style transformer to predict optimal actions without online updates during deployment. Across Darkroom, Miniworld, and MuJoCo, HT consistently outperforms strong baselines like DPT and GFT in online adaptation and generalization, including under transition uncertainty. The results suggest that integrating a learned reward belief with transformer-based ICL provides robust, scalable decision-making in both discrete and continuous action domains, with practical implications for offline-to-online transfer and settings where online data collection is costly or unsafe.
Abstract
Transformers have demonstrated exceptional in-context learning (ICL) capabilities, enabling applications across natural language processing, computer vision, and sequential decision-making. In reinforcement learning, ICL reframes learning as a supervised problem, facilitating task adaptation without parameter updates. Building on prior work leveraging transformers for sequential decision-making, we propose Heuristic Transformer (HT), an in-context reinforcement learning (ICRL) approach that augments the in-context dataset with a belief distribution over rewards to achieve better decision-making. Using a variational auto-encoder (VAE), a low-dimensional stochastic variable is learned to represent the posterior distribution over rewards, which is incorporated alongside an in-context dataset and query states as prompt to the transformer policy. We assess the performance of HT across the Darkroom, Miniworld, and MuJoCo environments, showing that it consistently surpasses comparable baselines in terms of both effectiveness and generalization. Our method presents a promising direction to bridge the gap between belief-based augmentations and transformer-based decision-making.
