Table of Contents
Fetching ...

From Few to Many: Self-Improving Many-Shot Reasoners Through Iterative Optimization and Generation

Xingchen Wan, Han Zhou, Ruoxi Sun, Hootan Nakhost, Ke Jiang, Sercan Ö. Arık

TL;DR

The paper investigates why scaling in-context demonstrations helps, proposing that a small set of influential examples largely drives gains and that these can be amplified by regenerating reasoning paths. It introduces BRIDGE, a two-stage, iterative algorithm that uses Bayesian optimization to select an optimal demonstration subset (optimize) and then regenerates new examples from that subset to expand the reasoning paths (generate). Across multiple long-context LLMs and diverse tasks, BRIDGE yields consistent improvements over reinforced ICL and outperforms baselines, with findings that the optimal number of demonstrations varies by task and that regeneration can push performance beyond naive scaling. The work offers a practical, model-agnostic approach to bridge few- and many-shot ICL, with potential for transferability and cost-efficiency in real-world applications.

Abstract

Recent advances in long-context large language models (LLMs) have led to the emerging paradigm of many-shot in-context learning (ICL), where it is observed that scaling many more demonstrating examples beyond the conventional few-shot setup in the context can lead to performance benefits. However, despite its promise, it is unclear what aspects dominate the benefits and whether simply scaling to more examples is the most effective way of improving many-shot ICL. In this work, we first provide an analysis of the factors driving many-shot ICL, and we find that 1) many-shot performance can still be attributed to often a few disproportionately influential examples and 2) identifying such influential examples ("optimize") and using them as demonstrations to regenerate new examples ("generate") can lead to further improvements. Inspired by the findings, we propose BRIDGE, an algorithm that alternates between the optimize step with Bayesian optimization to discover the influential sets of examples and the generate step to reuse this set to expand the reasoning paths of the examples back to the many-shot regime automatically. On Gemini, Claude, and Mistral LLMs of different sizes, we show that BRIDGE to significant improvements across a diverse set of tasks, including symbolic reasoning, numerical reasoning, and code generation.

From Few to Many: Self-Improving Many-Shot Reasoners Through Iterative Optimization and Generation

TL;DR

The paper investigates why scaling in-context demonstrations helps, proposing that a small set of influential examples largely drives gains and that these can be amplified by regenerating reasoning paths. It introduces BRIDGE, a two-stage, iterative algorithm that uses Bayesian optimization to select an optimal demonstration subset (optimize) and then regenerates new examples from that subset to expand the reasoning paths (generate). Across multiple long-context LLMs and diverse tasks, BRIDGE yields consistent improvements over reinforced ICL and outperforms baselines, with findings that the optimal number of demonstrations varies by task and that regeneration can push performance beyond naive scaling. The work offers a practical, model-agnostic approach to bridge few- and many-shot ICL, with potential for transferability and cost-efficiency in real-world applications.

Abstract

Recent advances in long-context large language models (LLMs) have led to the emerging paradigm of many-shot in-context learning (ICL), where it is observed that scaling many more demonstrating examples beyond the conventional few-shot setup in the context can lead to performance benefits. However, despite its promise, it is unclear what aspects dominate the benefits and whether simply scaling to more examples is the most effective way of improving many-shot ICL. In this work, we first provide an analysis of the factors driving many-shot ICL, and we find that 1) many-shot performance can still be attributed to often a few disproportionately influential examples and 2) identifying such influential examples ("optimize") and using them as demonstrations to regenerate new examples ("generate") can lead to further improvements. Inspired by the findings, we propose BRIDGE, an algorithm that alternates between the optimize step with Bayesian optimization to discover the influential sets of examples and the generate step to reuse this set to expand the reasoning paths of the examples back to the many-shot regime automatically. On Gemini, Claude, and Mistral LLMs of different sizes, we show that BRIDGE to significant improvements across a diverse set of tasks, including symbolic reasoning, numerical reasoning, and code generation.

Paper Structure

This paper contains 19 sections, 5 equations, 6 figures, 16 tables, 3 algorithms.

Figures (6)

  • Figure 1: It does not always take "many shots" to achieve many-shot performance -- with judicious selection, it is possible to match or exceed many-shot performance achieved by using all available examples) with much fewer examples: Accuracy on held-out splits against the number of examples on 3 BBH tasks of 1) overall trendline (fitted with locally weighted smoothing (lowess)), 2) using top-K most positive examples, or 3) using bottom-K least positive examples based on the ranking of the importance score described in Sec \ref{['sec:analysis']}. Dotted lines refer to two many-shot baselines: reinforced ICL: using input, model-generated reasoning and output of all correctly-predicted inputs; All example: using all available input-output pairs from the train set. Lines and error bars show mean $\pm$ standard deviation across 3 runs with the ordering of the examples shuffled each trial.
  • Figure 2: Good demonstrations lead to better re-generated examples: trendlines between accuracy and # examples; note that the re-generated examples by using top-5 examples sets as demonstrations outperform the original examples (gray line) by at all parts of the curve.
  • Figure 3: Overview of bridge: With a labeled dataset $\mathcal{D}$, exemplified with 6 samples, at the Generation phase (left half), we generate initial examples by performing LLM inference on the inputs of $\mathcal{D}$ ("Q1-6") with zero-shot prompting to obtain the initial responses "A1-6", which include any intermediate outputs critical for ICL (Step 1). At Step 2, consistent with reinforced ICL in agarwal2024many, we filter the responses to retain the subset of $\mathcal{D}$ where the LLM predicted correctly to ensure the examples include correct reasoning steps to build $\mathcal{E}_k$, the pool of examples at round $k$ which form the search space for the subsequent Optimize step. At the Optimize step (right half), we initialize the proposed Bayesian optimizer by randomly sampling subsets $\mathbf{e}^{(0)} \subseteq \mathcal{E}_k$ as demonstrations to be Step 3 evaluated on a held-out validation dataset ($\mathcal{D}$ can be reused for this purpose) to obtain a performance metric Step 4. The Bayesian optimizer (BO) is then updated with binary vector representations of $\mathbf{e}$ that led to this validation performance as input and the metric itself as output and suggests a new subset of examples to be used as demonstrations for the next step Step 5; Steps 4-5 are repeated (inner loop) until the BO budget is exhausted, after which the best evaluated set $\mathbf{e}^*_k$ is returned (Step 6). This set is then used as a demonstration to generate the example pool for the next round $\mathcal{E}_{k+1}$ (Step 7).
  • Figure 4: Benefits from scaling examples naïvely (red lines) is very task-specific, but each iteration of bridge addresses it to a considerable degree by continually improving upon the previous round: We randomly sample subsets of example pool $\mathcal{E}_k \, \forall \, k \in \{0 \text{\,(i.e., original examples generated with handcraft few-shot or zero-shot)}, 1, 2\}$ and evaluate them on a held-out set in four representative tasks exhibiting different model behavior to example scaling. The trendlines are moving regressions fitted with lowess. Refers to additional figures in App. \ref{['app:additional_visualization']}.
  • Figure 5: Additional visualization of the task performance at different rounds. Note that in most datasets, additional rounds of bridge led to performance improvement, and some of the exceptions (e.g., multi_arithmetric_two) are possibly caused by visualization artifacts of the extremely small performance variation as shown by the small y-axis ranges.
  • ...and 1 more figures