Reward Is Enough: LLMs Are In-Context Reinforcement Learners
Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra, Shangtong Zhang, Yanjun Qi
TL;DR
This work shows that reinforcement learning can emerge during the inference of large language models (LLMs). It introduces ICRL prompting, a minimal framework that uses only a scalar reward and multi-round prompts to drive self-improvement without any parameter updates, effectively performing in-context RL with context $C_t$ and actions $A_t \sim \pi_{\theta}(S_t, C_t)$. Across domains such as Game of 24, Creative Writing, ScienceWorld, and Olympiad-level math (AIME, HMMT), ICRL prompting yields significant performance gains over baselines like Self-Refine and Reflexion, even when rewards are provided by the same LLM. The results suggest a practical path to test-time scaling and autonomous adaptation in open-ended language tasks, aligning with the reward-is-enough hypothesis that intelligent behavior can be achieved by maximizing scalar feedback through context-driven learning.
Abstract
Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during inference, exhibiting behavior analogous to reinforcement learning. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Notably, even when the reward signals are generated by the same LLM, ICRL prompting still improves performance, highlighting a promising new paradigm for test-time scaling.
