LLMs Are In-Context Bandit Reinforcement Learners
Giovanni Monea, Antoine Bosselut, Kianté Brantley, Yoav Artzi
TL;DR
This work demonstrates that large language models can exhibit in-context bandit reinforcement learning by learning online from their own predictions and rewards without parameter updates. The authors compare Naive, Naive+, and Stochastic prompting strategies across multiple model families and five classification tasks, showing substantial, scalable improvements that in many cases approach supervised in-context learning upper bounds. Key contributions include identifying the importance of positive-reward filtering, introducing Stochastic ICRL to stabilize learning, and showing that larger models not only improve performance but also learning stability. The findings reveal both the potential and limitations of ICRL in current LLMs, highlighting exploration-driven reward signals and the need to address instability and learning from mistakes for practical deployment.
Abstract
Large Language Models (LLMs) excel at in-context learning (ICL), a supervised learning technique that relies on adding annotated examples to the model context. We investigate a contextual bandit version of in-context reinforcement learning (ICRL), where models learn in-context, online, from external reward, instead of supervised data. We show that LLMs effectively demonstrate such learning, and provide a detailed study of the phenomena, experimenting with challenging classification tasks and models of sizes from 500M to 70B parameters. This includes identifying and addressing the instability of the process, demonstrating learning with both semantic and abstract labels, and showing scaling trends. Our findings highlight ICRL capabilities in LLMs, while also underscoring fundamental limitations in their implicit reasoning about errors.
