Table of Contents
Fetching ...

LLMs Are In-Context Bandit Reinforcement Learners

Giovanni Monea, Antoine Bosselut, Kianté Brantley, Yoav Artzi

TL;DR

This work demonstrates that large language models can exhibit in-context bandit reinforcement learning by learning online from their own predictions and rewards without parameter updates. The authors compare Naive, Naive+, and Stochastic prompting strategies across multiple model families and five classification tasks, showing substantial, scalable improvements that in many cases approach supervised in-context learning upper bounds. Key contributions include identifying the importance of positive-reward filtering, introducing Stochastic ICRL to stabilize learning, and showing that larger models not only improve performance but also learning stability. The findings reveal both the potential and limitations of ICRL in current LLMs, highlighting exploration-driven reward signals and the need to address instability and learning from mistakes for practical deployment.

Abstract

Large Language Models (LLMs) excel at in-context learning (ICL), a supervised learning technique that relies on adding annotated examples to the model context. We investigate a contextual bandit version of in-context reinforcement learning (ICRL), where models learn in-context, online, from external reward, instead of supervised data. We show that LLMs effectively demonstrate such learning, and provide a detailed study of the phenomena, experimenting with challenging classification tasks and models of sizes from 500M to 70B parameters. This includes identifying and addressing the instability of the process, demonstrating learning with both semantic and abstract labels, and showing scaling trends. Our findings highlight ICRL capabilities in LLMs, while also underscoring fundamental limitations in their implicit reasoning about errors.

LLMs Are In-Context Bandit Reinforcement Learners

TL;DR

This work demonstrates that large language models can exhibit in-context bandit reinforcement learning by learning online from their own predictions and rewards without parameter updates. The authors compare Naive, Naive+, and Stochastic prompting strategies across multiple model families and five classification tasks, showing substantial, scalable improvements that in many cases approach supervised in-context learning upper bounds. Key contributions include identifying the importance of positive-reward filtering, introducing Stochastic ICRL to stabilize learning, and showing that larger models not only improve performance but also learning stability. The findings reveal both the potential and limitations of ICRL in current LLMs, highlighting exploration-driven reward signals and the need to address instability and learning from mistakes for practical deployment.

Abstract

Large Language Models (LLMs) excel at in-context learning (ICL), a supervised learning technique that relies on adding annotated examples to the model context. We investigate a contextual bandit version of in-context reinforcement learning (ICRL), where models learn in-context, online, from external reward, instead of supervised data. We show that LLMs effectively demonstrate such learning, and provide a detailed study of the phenomena, experimenting with challenging classification tasks and models of sizes from 500M to 70B parameters. This includes identifying and addressing the instability of the process, demonstrating learning with both semantic and abstract labels, and showing scaling trends. Our findings highlight ICRL capabilities in LLMs, while also underscoring fundamental limitations in their implicit reasoning about errors.
Paper Structure (39 sections, 19 figures, 7 tables, 3 algorithms)

This paper contains 39 sections, 19 figures, 7 tables, 3 algorithms.

Figures (19)

  • Figure 1: Illustration of in-context bandit reinforcement learning. The context shows a sequence of user queries, model responses, and feedback in the Banking77 77-label classification domain. The model learns in-context from rewards given to its previous predictions. The final prediction (shown in red) represents the model's current guess.
  • Figure 2: Performance of ICRL. Naive, Naive+, and Stochastic held-out test results for Llama and Qwen and all tasks. Naive+ and Stochastic consistently outperform zero-shot (i.e., first step) and Naive, while also showing consistent trends of continual improvement as more data is observed. \ref{['tab:main_results']} in \ref{['sec:appendix:results']} details start and end accuracies.
  • Figure 3: Reward ablations. Test accuracies of Naive and Stochastic with different reward signals. Positive reward only is the best choice for both methods. With Naive, no other strategy facilitates learning. \ref{['tab:results_ablations']} in \ref{['sec:appendix:results']} details start and end accuracies.
  • Figure 4: ICRL with Abstract Labels. We evaluate whether LLMs can learn tasks whose labels carry no semantic meaning by mapping each label to label_{number}. Even without initial exemplar demonstrations, Qwen and Llama show increasing performance over time. Gemini similarly excels when given an initial mapping, but struggles in a purely exploratory setting. \ref{['tab:abstract_tasks']} in \ref{['sec:appendix:results']} details start and end accuracies.
  • Figure 5: Comparison of Qwen models (500M--72B). We analyze scaling accuracy gains (a) and stability differences (b).
  • ...and 14 more figures