In-Context Reinforcement Learning for Variable Action Spaces

Viacheslav Sinii; Alexander Nikulin; Vladislav Kurenkov; Ilya Zisman; Sergey Kolesnikov

In-Context Reinforcement Learning for Variable Action Spaces

Viacheslav Sinii, Alexander Nikulin, Vladislav Kurenkov, Ilya Zisman, Sergey Kolesnikov

TL;DR

This work tackles the challenge of in-context reinforcement learning under variable discrete action spaces by removing the conventional output head and employing random action embeddings; actions are inferred from context via a contrastive objective, enabling zero-shot generalization to unseen actions. Headless-AD demonstrates robust generalization to action spaces up to $5\times$ larger than those seen during training and often outperforms specially trained baselines across Bernoulli and contextual bandits as well as a Darkroom-style MDP. Key contributions include the elimination of action-space dependence through embedding prompts and the use of InfoNCE loss to train a policy-improvement operator that operates over variable action sets. The approach advances foundational RL models toward versatility across diverse and evolving action spaces, with implications for scalable, pretrainable agents in real-world settings.

Abstract

Recently, it has been shown that transformers pre-trained on diverse datasets with multi-episode contexts can generalize to new reinforcement learning tasks in-context. A key limitation of previously proposed models is their reliance on a predefined action space size and structure. The introduction of a new action space often requires data re-collection and model re-training, which can be costly for some applications. In our work, we show that it is possible to mitigate this issue by proposing the Headless-AD model that, despite being trained only once, is capable of generalizing to discrete action spaces of variable size, semantic content and order. By experimenting with Bernoulli and contextual bandits, as well as a gridworld environment, we show that Headless-AD exhibits significant capability to generalize to action spaces it has never encountered, even outperforming specialized models trained for a specific set of actions on several environment configurations. Implementation is available at: https://github.com/corl-team/headless-ad.

In-Context Reinforcement Learning for Variable Action Spaces

TL;DR

larger than those seen during training and often outperforms specially trained baselines across Bernoulli and contextual bandits as well as a Darkroom-style MDP. Key contributions include the elimination of action-space dependence through embedding prompts and the use of InfoNCE loss to train a policy-improvement operator that operates over variable action sets. The approach advances foundational RL models toward versatility across diverse and evolving action spaces, with implications for scalable, pretrainable agents in real-world settings.

Abstract

Paper Structure (27 sections, 6 equations, 13 figures, 3 tables)

This paper contains 27 sections, 6 equations, 13 figures, 3 tables.

Introduction
Algorithm Distillation Struggles with Novel Action Spaces
Headless-AD
Experiments
Bernoulli Bandit
Contextual Bandit
Darkroom
Ablations
Action Set Prompt
Contrastive Loss
Orthonormal Action Embeddings
Related Work
Conclusion
Background
Related Work
...and 12 more sections

Figures (13)

Figure 1: Variable Action Spaces: We consider four types of novel action spaces different from the one used during training. Permuted Train Actions maintains the action set contents but reorders its elements. Test Actions introduces a completely new action set with an increased size. It is important to consider that some models may be architecturally limited to a fixed action set size. To evaluate the performance of such models on unseen actions, we adjust the size of a new set to be compatible with the model output. Therefore, we slice the first actions from the Test Actions set. Lastly, a new action space might include both the seen Train and unseen Test actions, depicted as the All Actions set.
Figure 2: Headless-AD Architecture: Compared to AD, Headless-AD introduces four new components. (1) We remove the output linear head, making the model directly predict the action embedding. That allows us to avoid a direct connection between the model and action space size, contents and ordering. (2.1) At each training step, we generate random action embeddings for each action in the action set. (2.2) We convert actions in the context into their embeddings and pass them as the model input. This prepares the model for unseen actions, forcing it to infer action semantics from the context. (3) As the model loses prior knowledge about action space structure, we pass the generated action embeddings as a prompt to aid the model in sensible action selection. (4) We convert a prediction vector into a distribution over actions based on the similarities between the prediction and previously generated action embeddings. To increase the probability of correct actions, we use contrastive loss instead of cross-entropy.
Figure 3: Algorithm Distillation Struggles with Novel Action Spaces: Despite its good results on the train action set, AD's performance diminishes when the action semantics change, either due to a permutation or substitution. It is important to note that augmenting the training data with permuted action sets does not lead to increased performance, signifying that action set invariance should be enforced from a model design standpoint. Additionally, it is impossible to apply a trained AD model to a larger action set. On the graph, the bars are the success rate values on the Darkroom environment (described in \ref{['sec:exp_gridworld']}) obtained after evaluating each of the action sets visualized in \ref{['fig:action_sets_viz']}, averaged over 5 runs. Altered Semantics aggregate the values from the Permuted Train Actions and Sliced Test Actions sets. Altered Size aggregates the values from Test Actions and All Actions. See \ref{['sec:exp_gridworld']} for more information about the construction of the action sets.
Figure 4: Algorithm Regret under Variable Reward Distributions in Bernoulli Bandit: The graph compares regret for Random, Thompson Sampling, and Headless-AD across distinct reward distributions in the Bernoulli Bandit environment, averaged from five seeds. During training, the high reward was $95\%$ more likely to distribute across the odd arms. During testing, it either switched to the even arms or a uniform distribution. Note that Headless-AD maintains high performance in all configurations, proving its ICL capabilities at generalizing to novel tasks, represented by changes in reward distribution. Data is aggregated from bandit problems with $4-20$ arms, reflecting the training conditions.
Figure 5: Algorithm Regret under Increasing Amount of Arms in Bernoulli Bandit: This series of plots shows the regret of Thompson Sampling, AD, and Headless-AD algorithms over evaluation steps in environments with $20-50$ arms, averaged from five seeds with $100$ bandits each. Although Headless-AD has been trained on bandits with up to $20$ arms, it performs well, matching or outperforming other algorithms in larger arm settings without additional training. Note that AD was retrained from scratch for each task with a different number of arms.
...and 8 more figures

In-Context Reinforcement Learning for Variable Action Spaces

TL;DR

Abstract

In-Context Reinforcement Learning for Variable Action Spaces

Authors

TL;DR

Abstract

Table of Contents

Figures (13)