In-Context Principle Learning from Mistakes
Tianjun Zhang, Aman Madaan, Luyu Gao, Steven Zheng, Swaroop Mishra, Yiming Yang, Niket Tandon, Uri Alon
TL;DR
LEAP introduces Learning Principles from Mistakes, a prompting framework that first induces model mistakes on a small set of few-shot examples, then extracts explicit low- and high-level principles from these mistakes, and finally uses these principles to improve inference on unseen questions without adding more inputs. Across diverse reasoning tasks—DROP, HotpotQA, GSM8K, MATH, and BBH—and multiple models (e.g., GPT-3.5-turbo, GPT-4, Claude-2.1, Gemini Pro), LEAP consistently improves over standard few-shot CoT, with notable gains in textual and mathematical reasoning and robust BBH performance. The approach is data-efficient, requiring exactly the same number of labeled examples as conventional few-shot prompting, and demonstrates that learning from mistakes can substantially augment how LLMs reason. However, open-source models may exhibit limited benefit from LEAP, indicating a dependency on instruction-following and reflection capabilities. The work situates LEAP within a broader landscape of prompting and feedback-based methods, highlighting its potential to generalize human-like learning from mistakes to AI systems at test time.
Abstract
In-context learning (ICL, also known as few-shot prompting) has been the standard method of adapting LLMs to downstream tasks, by learning from a few input-output examples. Nonetheless, all ICL-based approaches only learn from correct input-output pairs. In this paper, we revisit this paradigm, by learning more from the few given input-output examples. We introduce Learning Principles (LEAP): First, we intentionally induce the model to make mistakes on these few examples; then we reflect on these mistakes, and learn explicit task-specific "principles" from them, which help solve similar problems and avoid common mistakes; finally, we prompt the model to answer unseen test questions using the original few-shot examples and these learned general principles. We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH); in all these benchmarks, LEAP improves the strongest available LLMs such as GPT-3.5-turbo, GPT-4, GPT-4 turbo and Claude-2.1. For example, LEAP improves over the standard few-shot prompting using GPT-4 by 7.5% in DROP, and by 3.3% in HotpotQA. Importantly, LEAP does not require any more input or examples than the standard few-shot prompting settings.
