The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?
Yutao Sun, Mingshuai Chen, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Jianwei Yin
TL;DR
Crescent introduces a fully autonomous framework for self-improvement of LLMs by generating high-quality domain-specific QA data without external supervision. Through bait prompting, rejection-sampling-based diversification, and majority-vote consensus, Crescent produces robust QA pairs used to fine-tune models, yielding notable improvements in math reasoning while preserving general capabilities. Extensive experiments on GSM8K, ASDiv, and GSM-Plus-mini show strong 0-shot gains and competitive 5-shot performance, with ablations confirming the necessity of diversification and consensus. The study also demonstrates Crescent’s effectiveness for distillation to weaker models and highlights its advantages over prompting-based approaches, suggesting practical pathways toward scalable, self-contained model improvement. Limitations include domain specificity and applicability to aligned models, pointing to future work in broader domains and baseline model types.
Abstract
Self-improving large language models (LLMs) -- i.e., to improve the performance of an LLM by fine-tuning it with synthetic data generated by itself -- is a promising way to advance the capabilities of LLMs while avoiding extensive supervision. Existing approaches to self-improvement often rely on external supervision signals in the form of seed data and/or assistance from third-party models. This paper presents Crescent -- a simple yet effective framework for generating high-quality synthetic question-answer data in a fully autonomous manner. Crescent first elicits the LLM to generate raw questions via a bait prompt, then diversifies these questions leveraging a rejection sampling-based self-deduplication, and finally feeds the questions to the LLM and collects the corresponding answers by means of majority voting. We show that Crescent sheds light on the potential of true self-improvement with zero external supervision signals for math reasoning; in particular, Crescent-generated question-answer pairs suffice to (i) improve the reasoning capabilities of an LLM while preserving its general performance (especially in the 0-shot setting); and (ii) distil LLM knowledge to weaker models more effectively than existing methods based on seed-dataset augmentation.
