Table of Contents
Fetching ...

Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra, Shangtong Zhang, Yanjun Qi

TL;DR

This work shows that reinforcement learning can emerge during the inference of large language models (LLMs). It introduces ICRL prompting, a minimal framework that uses only a scalar reward and multi-round prompts to drive self-improvement without any parameter updates, effectively performing in-context RL with context $C_t$ and actions $A_t \sim \pi_{\theta}(S_t, C_t)$. Across domains such as Game of 24, Creative Writing, ScienceWorld, and Olympiad-level math (AIME, HMMT), ICRL prompting yields significant performance gains over baselines like Self-Refine and Reflexion, even when rewards are provided by the same LLM. The results suggest a practical path to test-time scaling and autonomous adaptation in open-ended language tasks, aligning with the reward-is-enough hypothesis that intelligent behavior can be achieved by maximizing scalar feedback through context-driven learning.

Abstract

Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during inference, exhibiting behavior analogous to reinforcement learning. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Notably, even when the reward signals are generated by the same LLM, ICRL prompting still improves performance, highlighting a promising new paradigm for test-time scaling.

Reward Is Enough: LLMs Are In-Context Reinforcement Learners

TL;DR

This work shows that reinforcement learning can emerge during the inference of large language models (LLMs). It introduces ICRL prompting, a minimal framework that uses only a scalar reward and multi-round prompts to drive self-improvement without any parameter updates, effectively performing in-context RL with context and actions . Across domains such as Game of 24, Creative Writing, ScienceWorld, and Olympiad-level math (AIME, HMMT), ICRL prompting yields significant performance gains over baselines like Self-Refine and Reflexion, even when rewards are provided by the same LLM. The results suggest a practical path to test-time scaling and autonomous adaptation in open-ended language tasks, aligning with the reward-is-enough hypothesis that intelligent behavior can be achieved by maximizing scalar feedback through context-driven learning.

Abstract

Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during inference, exhibiting behavior analogous to reinforcement learning. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Notably, even when the reward signals are generated by the same LLM, ICRL prompting still improves performance, highlighting a promising new paradigm for test-time scaling.

Paper Structure

This paper contains 21 sections, 16 figures, 5 tables, 1 algorithm.

Figures (16)

  • Figure 1: ICRL Prompting. At each episode $k+1$, LLM generates action tokens based on previous experiences up to $k$, and receives numerical rewards either from itself as the evaluator or from the environment. At the end of the episode, the rewards are then concatenated with the action tokens and placed back into the context.
  • Figure 2: Baseline Method Comparison.(Left) Mean Success Rate on Game of 24. (Middle) Mean Coherence Reward on Creative Writing. Both ICRL Preset and Self-Refine went through an additional run of 50 episodes. (Right) Mean Return on Science World. A running max version of the plots is available in Figure \ref{['fig:running_max_baselines']} in App. \ref{['sec appendix exp']}. This plot shows quality of the response at the current trial while the running max version shows the quality of the best response until now. The shaded region represents $\pm 1$ standard error of the performance calculated across the evaluated tasks.
  • Figure 3: Ablation Studies (Running Max).(Left) The mean of running max success rate on Game of 24. (Middle) The mean of running max coherence reward on creative criting. (Right). The mean of running max return on ScienceWorld. The shaded region represents $\pm 1$ standard error of the mean (SEM) of the performance calculated across the evaluated tasks within each benchmark.
  • Figure 4: The Exploration Instruction ($s_\text{ICRL}$).
  • Figure 5: The Exploitation Instruction ($s_\text{ICRL}$).
  • ...and 11 more figures