Table of Contents
Fetching ...

Heuristic Transformer: Belief Augmented In-Context Reinforcement Learning

Oliver Dippel, Alexei Lisitsa, Bei Peng

TL;DR

This work introduces Heuristic Transformer (HT), an in-context reinforcement learning method that augments transformer policies with a learned belief over reward functions. HT uses a variational autoencoder to produce a low-dimensional latent embedding $m$ and a posterior $b_h = q(m|tet{\eta}_{:h})$, which is incorporated into the transformer prompt along with an in-context dataset $D_{pre}$ and a query state $s_{query}$; Phase 1 trains the belief model via an ELBO objective, and Phase 2 trains a GPT-style transformer to predict optimal actions without online updates during deployment. Across Darkroom, Miniworld, and MuJoCo, HT consistently outperforms strong baselines like DPT and GFT in online adaptation and generalization, including under transition uncertainty. The results suggest that integrating a learned reward belief with transformer-based ICL provides robust, scalable decision-making in both discrete and continuous action domains, with practical implications for offline-to-online transfer and settings where online data collection is costly or unsafe.

Abstract

Transformers have demonstrated exceptional in-context learning (ICL) capabilities, enabling applications across natural language processing, computer vision, and sequential decision-making. In reinforcement learning, ICL reframes learning as a supervised problem, facilitating task adaptation without parameter updates. Building on prior work leveraging transformers for sequential decision-making, we propose Heuristic Transformer (HT), an in-context reinforcement learning (ICRL) approach that augments the in-context dataset with a belief distribution over rewards to achieve better decision-making. Using a variational auto-encoder (VAE), a low-dimensional stochastic variable is learned to represent the posterior distribution over rewards, which is incorporated alongside an in-context dataset and query states as prompt to the transformer policy. We assess the performance of HT across the Darkroom, Miniworld, and MuJoCo environments, showing that it consistently surpasses comparable baselines in terms of both effectiveness and generalization. Our method presents a promising direction to bridge the gap between belief-based augmentations and transformer-based decision-making.

Heuristic Transformer: Belief Augmented In-Context Reinforcement Learning

TL;DR

This work introduces Heuristic Transformer (HT), an in-context reinforcement learning method that augments transformer policies with a learned belief over reward functions. HT uses a variational autoencoder to produce a low-dimensional latent embedding and a posterior , which is incorporated into the transformer prompt along with an in-context dataset and a query state ; Phase 1 trains the belief model via an ELBO objective, and Phase 2 trains a GPT-style transformer to predict optimal actions without online updates during deployment. Across Darkroom, Miniworld, and MuJoCo, HT consistently outperforms strong baselines like DPT and GFT in online adaptation and generalization, including under transition uncertainty. The results suggest that integrating a learned reward belief with transformer-based ICL provides robust, scalable decision-making in both discrete and continuous action domains, with practical implications for offline-to-online transfer and settings where online data collection is costly or unsafe.

Abstract

Transformers have demonstrated exceptional in-context learning (ICL) capabilities, enabling applications across natural language processing, computer vision, and sequential decision-making. In reinforcement learning, ICL reframes learning as a supervised problem, facilitating task adaptation without parameter updates. Building on prior work leveraging transformers for sequential decision-making, we propose Heuristic Transformer (HT), an in-context reinforcement learning (ICRL) approach that augments the in-context dataset with a belief distribution over rewards to achieve better decision-making. Using a variational auto-encoder (VAE), a low-dimensional stochastic variable is learned to represent the posterior distribution over rewards, which is incorporated alongside an in-context dataset and query states as prompt to the transformer policy. We assess the performance of HT across the Darkroom, Miniworld, and MuJoCo environments, showing that it consistently surpasses comparable baselines in terms of both effectiveness and generalization. Our method presents a promising direction to bridge the gap between belief-based augmentations and transformer-based decision-making.

Paper Structure

This paper contains 47 sections, 17 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of the two-phase framework in HT. In Phase 1, a variational autoencoder (VAE) infers the belief distribution over reward functions $b_{h}=q_{\phi}(m\mid\eta_{:h})$ from offline transitions $\eta$. In Phase 2, the learned belief $b_h$, combined with a query state $s_{\text{query}}$ and the in-context dataset $D_{\text{pre}}$ form the input to the transformer policy model $M_{\theta}$ to learn the optimal action distribution $a^{\star}(\cdot)$.
  • Figure 2: (a) Online performance on test goals in Darkroom. (b) Darkroom after certain pre-training epochs. (c) Darkroom Hard. (d) Darkroom Hard after certain pre-training epochs. Results are mean return $\pm$ std over 20 trials across 10 seeds.
  • Figure 3: Online performance on test goals in the Darkroom Stochastic environment under varying levels of transition noise. (a) 20% random action misdirection. (b) 40% random action misdirection. Results are mean return $\pm$ std, averaged over 20 trials and 5 seeds.
  • Figure 4: (a) Online performance on test goals in Miniworld. (b) Online performance on test goals in Miniworld, after certain pre-training epochs. Results are the mean return $\pm$ standard deviation on 20 trials across 5 seeds.
  • Figure 5: Online cumulative regret in the Bandit environment. Reported are the mean return $\pm$ standard deviation on 20 trials across 10 seeds.
  • ...and 4 more figures