Table of Contents
Fetching ...

In-Context Principle Learning from Mistakes

Tianjun Zhang, Aman Madaan, Luyu Gao, Steven Zheng, Swaroop Mishra, Yiming Yang, Niket Tandon, Uri Alon

TL;DR

LEAP introduces Learning Principles from Mistakes, a prompting framework that first induces model mistakes on a small set of few-shot examples, then extracts explicit low- and high-level principles from these mistakes, and finally uses these principles to improve inference on unseen questions without adding more inputs. Across diverse reasoning tasks—DROP, HotpotQA, GSM8K, MATH, and BBH—and multiple models (e.g., GPT-3.5-turbo, GPT-4, Claude-2.1, Gemini Pro), LEAP consistently improves over standard few-shot CoT, with notable gains in textual and mathematical reasoning and robust BBH performance. The approach is data-efficient, requiring exactly the same number of labeled examples as conventional few-shot prompting, and demonstrates that learning from mistakes can substantially augment how LLMs reason. However, open-source models may exhibit limited benefit from LEAP, indicating a dependency on instruction-following and reflection capabilities. The work situates LEAP within a broader landscape of prompting and feedback-based methods, highlighting its potential to generalize human-like learning from mistakes to AI systems at test time.

Abstract

In-context learning (ICL, also known as few-shot prompting) has been the standard method of adapting LLMs to downstream tasks, by learning from a few input-output examples. Nonetheless, all ICL-based approaches only learn from correct input-output pairs. In this paper, we revisit this paradigm, by learning more from the few given input-output examples. We introduce Learning Principles (LEAP): First, we intentionally induce the model to make mistakes on these few examples; then we reflect on these mistakes, and learn explicit task-specific "principles" from them, which help solve similar problems and avoid common mistakes; finally, we prompt the model to answer unseen test questions using the original few-shot examples and these learned general principles. We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH); in all these benchmarks, LEAP improves the strongest available LLMs such as GPT-3.5-turbo, GPT-4, GPT-4 turbo and Claude-2.1. For example, LEAP improves over the standard few-shot prompting using GPT-4 by 7.5% in DROP, and by 3.3% in HotpotQA. Importantly, LEAP does not require any more input or examples than the standard few-shot prompting settings.

In-Context Principle Learning from Mistakes

TL;DR

LEAP introduces Learning Principles from Mistakes, a prompting framework that first induces model mistakes on a small set of few-shot examples, then extracts explicit low- and high-level principles from these mistakes, and finally uses these principles to improve inference on unseen questions without adding more inputs. Across diverse reasoning tasks—DROP, HotpotQA, GSM8K, MATH, and BBH—and multiple models (e.g., GPT-3.5-turbo, GPT-4, Claude-2.1, Gemini Pro), LEAP consistently improves over standard few-shot CoT, with notable gains in textual and mathematical reasoning and robust BBH performance. The approach is data-efficient, requiring exactly the same number of labeled examples as conventional few-shot prompting, and demonstrates that learning from mistakes can substantially augment how LLMs reason. However, open-source models may exhibit limited benefit from LEAP, indicating a dependency on instruction-following and reflection capabilities. The work situates LEAP within a broader landscape of prompting and feedback-based methods, highlighting its potential to generalize human-like learning from mistakes to AI systems at test time.

Abstract

In-context learning (ICL, also known as few-shot prompting) has been the standard method of adapting LLMs to downstream tasks, by learning from a few input-output examples. Nonetheless, all ICL-based approaches only learn from correct input-output pairs. In this paper, we revisit this paradigm, by learning more from the few given input-output examples. We introduce Learning Principles (LEAP): First, we intentionally induce the model to make mistakes on these few examples; then we reflect on these mistakes, and learn explicit task-specific "principles" from them, which help solve similar problems and avoid common mistakes; finally, we prompt the model to answer unseen test questions using the original few-shot examples and these learned general principles. We evaluate LEAP on a wide range of benchmarks, including multi-hop question answering (Hotpot QA), textual QA (DROP), Big-Bench Hard reasoning, and math problems (GSM8K and MATH); in all these benchmarks, LEAP improves the strongest available LLMs such as GPT-3.5-turbo, GPT-4, GPT-4 turbo and Claude-2.1. For example, LEAP improves over the standard few-shot prompting using GPT-4 by 7.5% in DROP, and by 3.3% in HotpotQA. Importantly, LEAP does not require any more input or examples than the standard few-shot prompting settings.
Paper Structure (30 sections, 10 figures, 9 tables, 1 algorithm)

This paper contains 30 sections, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: Examples for learned principles using LEAP, with key idea of each principle highlighted.
  • Figure 2: An illustration of LEAP: Given a few input-output examples, Chain-of-Thought (left) generates a response to the test question by directly learning from the (correct) examples. In contrast, Learning Principles (LEAP, right) first (a) generates mistaken zero-shot Chain-of-Thought response for each given input-output example by sampling with a non-zero temperature; (b) generates explicit principles by providing the LLM with the mistaken CoT along with the correct output; and finally (c) generates a response to the test question, by providing the LLM with both the given input-output examples and the learned principles. Note that steps (a) and (b) are performed once per task.
  • Figure 3: LEAP prompt to help LLM evaluate its own generated reasoning and answers, contrasting them with the correct reasoning and answers. The LLM is prompted to identify errors in its reasoning and extract key insights for improvement. This figure specifically represents the 'GenerateExplanation' step in the LEAP algorithm (\ref{['alg:ours_alg']}).
  • Figure 4: Accuracy in bbh tasks, across gpt-3.5-turbo-0613 , gpt-4-0613 , and gemini-pro . The figure presents the results using a scatter plot, where the y-axis represents scores achieved with LEAP, and the x-axis represents the baseline scores from CoT . Each task is represented by a point on the plot, with different shapes assigned to different models for easy distinction. Tasks above the $y=x$ line are those where LEAP leads to an improvement in performance. \ref{['tab:results:bbh:all']} shows the detailed results for all 27 Big-Bench hard tasks. We find that in 37 out of 42 combinations of task and LLM , one of LEAP$_\textsc{low-level}$ or LEAP$_\textsc{high-level}$ outperforms the baseline Few-shot CoT .
  • Figure 5: Examples from the Boolean Expressions (left) and Object counting (right) tasks from bbh . The learned principle is highlighted in yellow, the mistaken step of the baseline is highlighted in red, and the correct use of the principle by LEAP is highlighted in green. This demonstrates howclearly the reason why the learned principles guide LEAP in generating a better answer.
  • ...and 5 more figures