Table of Contents
Fetching ...

Mitigating Tail Narrowing in LLM Self-Improvement via Socratic-Guided Sampling

Yiwen Ding, Zhiheng Xi, Wei He, Zhuoyuan Li, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

This work identifies tail narrowing as a core bottleneck in LLM self-improvement, where iterative training over self-generated data increasingly under-samples hard queries. It introduces Guided Self-Improvement (GSI), a set of Socratic-style guidance strategies—answer-driven, rationale-driven, interactive sampling, and state reset—to improve sampling efficiency and broaden coverage of difficult problems without prohibitive cost. Across four backbone models and six mathematical-reasoning tasks, GSI achieves better coverage and performance than standard self-improvement and brute-force rebalancing, with notable gains for larger models and PoT-oriented prompting. The results demonstrate enhanced generalization to held-out tasks and reveal practical trade-offs between strategy choice, model size, and sampling budget, offering a scalable path to more robust self-improving systems.

Abstract

Self-improvement methods enable large language models (LLMs) to generate solutions themselves and iteratively train on filtered, high-quality rationales. This process proves effective and reduces the reliance on human supervision in LLMs' reasoning, but the performance soon plateaus. We delve into the process and find that models tend to over-sample on easy queries and under-sample on queries they have yet to master. As iterations proceed, this imbalance in sampling is exacerbated, leading to a long-tail distribution where solutions to difficult queries almost diminish. This phenomenon limits the performance gain of self-improving models. A straightforward solution is brute-force sampling to balance the distribution, which significantly raises computational costs. In this paper, we introduce Guided Self-Improvement (GSI), a strategy aimed at improving the efficiency of sampling challenging heavy-tailed data. It leverages Socratic-style guidance signals to help LLM reasoning with complex queries, reducing the exploration effort and minimizing computational overhead. Experiments on four models across diverse mathematical tasks show that GSI strikes a balance between performance and efficiency, while also being effective on held-out tasks.

Mitigating Tail Narrowing in LLM Self-Improvement via Socratic-Guided Sampling

TL;DR

This work identifies tail narrowing as a core bottleneck in LLM self-improvement, where iterative training over self-generated data increasingly under-samples hard queries. It introduces Guided Self-Improvement (GSI), a set of Socratic-style guidance strategies—answer-driven, rationale-driven, interactive sampling, and state reset—to improve sampling efficiency and broaden coverage of difficult problems without prohibitive cost. Across four backbone models and six mathematical-reasoning tasks, GSI achieves better coverage and performance than standard self-improvement and brute-force rebalancing, with notable gains for larger models and PoT-oriented prompting. The results demonstrate enhanced generalization to held-out tasks and reveal practical trade-offs between strategy choice, model size, and sampling budget, offering a scalable path to more robust self-improving systems.

Abstract

Self-improvement methods enable large language models (LLMs) to generate solutions themselves and iteratively train on filtered, high-quality rationales. This process proves effective and reduces the reliance on human supervision in LLMs' reasoning, but the performance soon plateaus. We delve into the process and find that models tend to over-sample on easy queries and under-sample on queries they have yet to master. As iterations proceed, this imbalance in sampling is exacerbated, leading to a long-tail distribution where solutions to difficult queries almost diminish. This phenomenon limits the performance gain of self-improving models. A straightforward solution is brute-force sampling to balance the distribution, which significantly raises computational costs. In this paper, we introduce Guided Self-Improvement (GSI), a strategy aimed at improving the efficiency of sampling challenging heavy-tailed data. It leverages Socratic-style guidance signals to help LLM reasoning with complex queries, reducing the exploration effort and minimizing computational overhead. Experiments on four models across diverse mathematical tasks show that GSI strikes a balance between performance and efficiency, while also being effective on held-out tasks.

Paper Structure

This paper contains 40 sections, 6 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Illustration of distribution during the self-improvement sampling process. Top: The long-tail effect intensifies with iterative training on self-generated data. The low-probability data begins to diminish, leading to tail narrowing. Bottem: Guided sampling balances the distribution by improving tail data sampling efficiency.
  • Figure 2: Iterative performance in the self-improvement. Experiments are conducted on GSM8K with varying sampling numbers $k$. Solid markers show the performance of vanilla self-improve, with the solid line fitting these points. The performance plateaus after a few iterations. Hollow markers represent the performance after supplementing tail data, with a dashed line trend. It balances the distribution and alleviates performance bottlenecks.
  • Figure 3: Comparison of data distributions between the self-generated and original (SFT) datasets. (a) Difficulty distribution across five levels in MATH tasks, with level 1 representing the easiest and level 5 the most difficult. The self-generated data has a lower proportion of difficult problems. (b) Length distribution indicates that the self-generated data tends to be shorter compared to the original dataset. (c) Perplexity diagram of each training sequence measured with the Llama3-8B, shows that the tails in the self-generated data are diminished.
  • Figure 4: Comparison of average performance on six math tasks using PoT. The percentage of improvement is significant in the held-in datasets.
  • Figure 5: Comparison of the number of rationale steps generated by the model relative to the number of steps used in the ground truth. The red box highlights the occurrence of skip steps in the rationale-driven strategy.
  • ...and 11 more figures