Learning to Learn from Language Feedback with Social Meta-Learning

Jonathan Cook; Diego Antognini; Martin Klissarov; Claudiu Musat; Edward Grefenstette

Learning to Learn from Language Feedback with Social Meta-Learning

Jonathan Cook, Diego Antognini, Martin Klissarov, Claudiu Musat, Edward Grefenstette

TL;DR

This work presents Social Meta-Learning (SML), a finetuning framework that trains LLMs to learn from language feedback by turning static problems into interactive pedagogical dialogues between a student and a teacher with private knowledge. It compares offline supervised finetuning and online reinforcement learning (GRPO), showing online RL yields stronger gains, cross-domain transfer from math to coding, and improved handling of ambiguity when combined with a Q-Priming stage that encourages information-seeking questions. The approach demonstrates that learning from language feedback can generalize across domains and dialogue lengths, and that explicit nurturing of questioning behaviour further enhances adaptability. Overall, SML provides a scalable path toward more collaborative, human-aligned AI systems that can learn how to learn from conversational feedback.

Abstract

Large language models (LLMs) often struggle to learn from corrective feedback within a conversational context. They are rarely proactive in soliciting this feedback, even when faced with ambiguity, which can make their dialogues feel static, one-sided, and lacking the adaptive qualities of human conversation. To address these limitations, we draw inspiration from social meta-learning (SML) in humans - the process of learning how to learn from others. We formulate SML as a finetuning methodology, training LLMs to solicit and learn from language feedback in simulated pedagogical dialogues, where static tasks are converted into interactive social learning problems. SML effectively teaches models to use conversation to solve problems they are unable to solve in a single turn. This capability generalises across domains; SML on math problems produces models that better use feedback to solve coding problems and vice versa. Furthermore, despite being trained only on fully-specified problems, these models are better able to solve underspecified tasks where critical information is revealed over multiple turns. When faced with this ambiguity, SML-trained models make fewer premature answer attempts and are more likely to ask for the information they need. This work presents a scalable approach to developing AI systems that effectively learn from language feedback.

Learning to Learn from Language Feedback with Social Meta-Learning

TL;DR

Abstract

Paper Structure (30 sections, 4 equations, 8 figures, 1 table)

This paper contains 30 sections, 4 equations, 8 figures, 1 table.

Introduction
Preliminaries
Offline Reinforcement Learning.
Online Reinforcement Learning.
Related Work
Multi-Turn Interaction with LLMs
Learning from Language Feedback
Social Reinforcement Learning
Social Meta-Learning for Language Models
Problem Formulation
Learning from Language Feedback (Inner Loop)
Meta-Training (Outer Loop)
Offline RL.
Online RL.
Learning to Enquire with Q-Priming
...and 15 more sections

Figures (8)

Figure 1: Comparing different multi-turn and single-turn finetuning strategies on Omni-MATH.
Figure 2: Evaluating the impact of using stronger models as teachers on Omni-MATH. In the right plot, the same teacher (Gemma-3-12B-IT) is used for training each student at test time.
Figure 3: Average loss on the correct answer across conversational turns for Omni-MATH.
Figure 4: Evaluating transfer of the ability to learn from language feedback between math and code domains. Left: Training on Omni-MATH and evaluating on LiveCodeBench; Right: Training on OpenCodeInstruct and evaluating on Omni-MATH.
Figure 5: Evaluating the impact of Q-priming on multi-turn performance (left) and the rate of question asking per conversation (right) for Omni-MATH.
...and 3 more figures

Learning to Learn from Language Feedback with Social Meta-Learning

TL;DR

Abstract

Learning to Learn from Language Feedback with Social Meta-Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)