Table of Contents
Fetching ...

Improving Interactive In-Context Learning from Natural Language Feedback

Martin Klissarov, Jonathan Cook, Diego Antognini, Hao Sun, Jingling Li, Natasha Jaques, Claudiu Musat, Edward Grefenstette

TL;DR

This work tackles how language models can learn from corrective natural-language feedback by reframing feedback as multi-turn didactic interactions between a student and a teacher with privileged information. It introduces Reinforcement Learning with Language Feedback (RL$^2$F), a scalable method that converts single-turn verifiable problems into multi-turn dialogues and trains the student to integrate feedback via RL. The results show that a smaller model trained with RL$^2$F nearly matches a much larger model on verifiable reasoning tasks and generalizes to coding, puzzles, and maze navigation, driven by enhanced in-context plasticity. It further demonstrates a pathway to self-improvement by having the model internalize the feedback loop and self-correct at inference, i.e., becoming autodidactic, with broad implications for data-efficient continual learning.

Abstract

Adapting one's thought process based on corrective feedback is an essential ability in human learning, particularly in collaborative settings. In contrast, the current large language model training paradigm relies heavily on modeling vast, static corpora. While effective for knowledge acquisition, it overlooks the interactive feedback loops essential for models to adapt dynamically to their context. In this work, we propose a framework that treats this interactive in-context learning ability not as an emergent property, but as a distinct, trainable skill. We introduce a scalable method that transforms single-turn verifiable tasks into multi-turn didactic interactions driven by information asymmetry. We first show that current flagship models struggle to integrate corrective feedback on hard reasoning tasks. We then demonstrate that models trained with our approach dramatically improve the ability to interactively learn from language feedback. More specifically, the multi-turn performance of a smaller model nearly reaches that of a model an order of magnitude larger. We also observe robust out-of-distribution generalization: interactive training on math problems transfers to diverse domains like coding, puzzles and maze navigation. Our qualitative analysis suggests that this improvement is due to an enhanced in-context plasticity. Finally, we show that this paradigm offers a unified path to self-improvement. By training the model to predict the teacher's critiques, effectively modeling the feedback environment, we convert this external signal into an internal capability, allowing the model to self-correct even without a teacher.

Improving Interactive In-Context Learning from Natural Language Feedback

TL;DR

This work tackles how language models can learn from corrective natural-language feedback by reframing feedback as multi-turn didactic interactions between a student and a teacher with privileged information. It introduces Reinforcement Learning with Language Feedback (RLF), a scalable method that converts single-turn verifiable problems into multi-turn dialogues and trains the student to integrate feedback via RL. The results show that a smaller model trained with RLF nearly matches a much larger model on verifiable reasoning tasks and generalizes to coding, puzzles, and maze navigation, driven by enhanced in-context plasticity. It further demonstrates a pathway to self-improvement by having the model internalize the feedback loop and self-correct at inference, i.e., becoming autodidactic, with broad implications for data-efficient continual learning.

Abstract

Adapting one's thought process based on corrective feedback is an essential ability in human learning, particularly in collaborative settings. In contrast, the current large language model training paradigm relies heavily on modeling vast, static corpora. While effective for knowledge acquisition, it overlooks the interactive feedback loops essential for models to adapt dynamically to their context. In this work, we propose a framework that treats this interactive in-context learning ability not as an emergent property, but as a distinct, trainable skill. We introduce a scalable method that transforms single-turn verifiable tasks into multi-turn didactic interactions driven by information asymmetry. We first show that current flagship models struggle to integrate corrective feedback on hard reasoning tasks. We then demonstrate that models trained with our approach dramatically improve the ability to interactively learn from language feedback. More specifically, the multi-turn performance of a smaller model nearly reaches that of a model an order of magnitude larger. We also observe robust out-of-distribution generalization: interactive training on math problems transfers to diverse domains like coding, puzzles and maze navigation. Our qualitative analysis suggests that this improvement is due to an enhanced in-context plasticity. Finally, we show that this paradigm offers a unified path to self-improvement. By training the model to predict the teacher's critiques, effectively modeling the feedback environment, we convert this external signal into an internal capability, allowing the model to self-correct even without a teacher.
Paper Structure (13 sections, 4 equations, 9 figures, 2 tables)

This paper contains 13 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Trained through RL on multi-turn didactic interactions, Gemini 2.5 Flash nearly reaches the performance of Gemini 2.5 Pro on the challenging HardMath2 dataset. Single-turn RL only slightly improves on the baseline's ability to in-context learn from language feedback across turns.
  • Figure 2: (Top) Didactic Interactions via Information Asymmetry: We transform single-turn problems into multi-turn didactic interactions. A teacher model, conditioned on privileged information (e.g., the ground-truth solution), provides natural language feedback to a student model without revealing the final answer, guiding it to correct its errors. (Middle) Train-time RL Fine-tuning: We train the student model to effectively incorporate language feedback using RL. The student iterates through multiple turns; if the answer is correct (checked via automatic verification), a reward ($R=+1$) is granted and the interaction ends. If it is incorrect, the teacher provides feedback. If the interaction reaches the maximum number of turns, $T_{max}$ turns, the reward is zero. (Bottom) Inference-time Evaluations: We assess the trained model in three settings: (1) Learning from language feedback (interacting with an external source of language feedback), (2) General Multi-turn Tasks (out-of-domain tasks like logic puzzles or games), and (3) In-context Self-Improvement, where the model plays the role of both student and teacher to self-correct.
  • Figure 3: Comparison of the interactive ability between two leading closed-source models. The user prompts both models to "be concise" after an initial lengthy explanation. Model B (Right) correctly interprets this as a request to summarize the previous explanation. Model A (Left) interprets this as a setting change for future turns and acknowledges the instruction without actually rewriting the content, failing to satisfy the user's obvious intent.
  • Figure 4: Cumulative accuracy across four hard reasoning domains: HardMath2, ARC-AGI, BBEH, and Codeforces. The plots illustrate the limited ability of the Gemini 2.5 models and GPT-5 to incorporate language feedback over multiple turns. We observe this ability scales with model size, with GPT-5 generally exhibiting stronger performance gains compared to the Gemini 2.5 Pro model.
  • Figure 5: RL2F on multi-turn didactic interactions is key to improve the interactive in-context learning abilities of LLMs. The improved performance additionally transfers significantly better to out-of-distribution tasks than single-turn RL.
  • ...and 4 more figures