Table of Contents
Fetching ...

Not the Example, but the Process: How Self-Generated Examples Enhance LLM Reasoning

Daehoon Gwak, Minseo Jung, Junwoo Park, Minho Park, ChaeHun Park, Junha Hyung, Jaegul Choo

TL;DR

Not the Example, but the Process investigates why self-generated examples improve LLM reasoning, hypothesizing that the problem-creation process, not the final examples themselves, drives gains. It systematically compares Zero-shot, Integrated, and Decoupled prompting across five architectures on MATH and GSM8K, with attention analyses to reveal internal mechanisms. The main finding is that Integrated prompting, which couples problem generation and solving, yields the best performance, while Decoupled offers only marginal gains. The results provide design guidance for prompting strategies in complex reasoning tasks and suggest that fostering the problem-creation process can reduce reliance on manual exemplars.

Abstract

Recent studies have shown that Large Language Models (LLMs) can improve their reasoning performance through self-generated few-shot examples, achieving results comparable to manually curated in-context examples. However, the underlying mechanism behind these gains remains unclear, making it hard to decide when and how to apply the technique effectively. In this work, we argue that the key benefit arises not from the generated examples themselves but from the act of creating them. To validate this, on reasoning-intensive tasks across diverse LLM architectures, we systematically evaluate three prompting strategies for in-context learning: (1) Zero-shot prompting; (2) Integrated prompting, where LLMs create and solve problems within a single, unified prompt; and (3) Decoupled prompting, where self-generated examples are reused as in-context examples, but the context of their creation itself is excluded. We conduct experiments across five widely used model architectures, demonstrating that Integrated prompting consistently outperforms both Zero-shot and Decoupled prompting. In contrast, Decoupled prompting offers only marginal gains over Zero-shot. Further, for a more in-depth analysis, we conduct an attention analysis and observe significant differences in attention patterns between Integrated and Decoupled prompting. These findings suggest that the advantage of self-generation prompting comes from the process of problem creation, not the examples themselves, providing valuable insights for designing more effective prompting strategies.

Not the Example, but the Process: How Self-Generated Examples Enhance LLM Reasoning

TL;DR

Not the Example, but the Process investigates why self-generated examples improve LLM reasoning, hypothesizing that the problem-creation process, not the final examples themselves, drives gains. It systematically compares Zero-shot, Integrated, and Decoupled prompting across five architectures on MATH and GSM8K, with attention analyses to reveal internal mechanisms. The main finding is that Integrated prompting, which couples problem generation and solving, yields the best performance, while Decoupled offers only marginal gains. The results provide design guidance for prompting strategies in complex reasoning tasks and suggest that fostering the problem-creation process can reduce reliance on manual exemplars.

Abstract

Recent studies have shown that Large Language Models (LLMs) can improve their reasoning performance through self-generated few-shot examples, achieving results comparable to manually curated in-context examples. However, the underlying mechanism behind these gains remains unclear, making it hard to decide when and how to apply the technique effectively. In this work, we argue that the key benefit arises not from the generated examples themselves but from the act of creating them. To validate this, on reasoning-intensive tasks across diverse LLM architectures, we systematically evaluate three prompting strategies for in-context learning: (1) Zero-shot prompting; (2) Integrated prompting, where LLMs create and solve problems within a single, unified prompt; and (3) Decoupled prompting, where self-generated examples are reused as in-context examples, but the context of their creation itself is excluded. We conduct experiments across five widely used model architectures, demonstrating that Integrated prompting consistently outperforms both Zero-shot and Decoupled prompting. In contrast, Decoupled prompting offers only marginal gains over Zero-shot. Further, for a more in-depth analysis, we conduct an attention analysis and observe significant differences in attention patterns between Integrated and Decoupled prompting. These findings suggest that the advantage of self-generation prompting comes from the process of problem creation, not the examples themselves, providing valuable insights for designing more effective prompting strategies.
Paper Structure (43 sections, 7 figures, 5 tables)

This paper contains 43 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the prompting strategies evaluated in this work. Each strategy is depicted under the same test question to illustrate how the interaction between the user and assistant differs. (a) Zero-shot involves no context; (b) Integrated combines problem creation and solving in a single prompt; (c) Decoupled leverages self-generated examples as in-context exemplars without including the creation process in the conversational history.
  • Figure 2: Distribution of attention during test-question solving. The upper histogram shows attention to test-question tokens, with Decoupled devoting significantly more attention. The lower histogram shows attention to self-generated example tokens, with Integrated exhibiting significantly higher attention ($p\!<\!10^{-10}$, paired t-test).
  • Figure 3: Layer-wise average attention to self-generated examples. The x-axis represents layer indices, and the y-axis shows the average attention that tokens generated during test-question solving assign to self-generated example tokens. On LLaMA 3.1 8B Instruct, Integrated exhibits higher attention scores than Decoupled in the lower layers (0–12) and upper layers (23–31) among the 32 layers (0–31).
  • Figure 4: Performance across problem categories and difficulty levels on MATH dataset. (Top) Accuracy comparison between Zero-shot prompting and Integrated prompting across different problem categories. (Bottom) The same comparison across difficulty levels.
  • Figure 5: Prompt examples for Zero-shot, and Integrated prompting. Each example illustrates the input-output structure of prompting.
  • ...and 2 more figures