Table of Contents
Fetching ...

ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement

Xiangyu Peng, Congying Xia, Xinyi Yang, Caiming Xiong, Chien-Sheng Wu, Chen Xing

TL;DR

ReGenesis tackles the challenge of enabling LLMs to generalize reasoning beyond their training tasks by self-generating diverse, abstract-to-concrete reasoning paths as post-training data. The framework comprises Guidance Adaption, Reasoning Structure Generation, and Reasoning Path Generation, with filtering via ground-truth or self-consistency and optional ground-truth hints to recover valid reasoning. Empirical results show substantial in-domain gains (average around 16.56%) and meaningful out-of-domain improvements (about 6.1%), outperforming STaR, LMSI, and GT-based baselines, and demonstrating robustness across models like Mistral-7B-Instruct-v0.3 and Meta-Llama-3-8B-Instruct. The work provides broad analysis of design choices, model effects, and guideline preferences, and argues that preserving task-agnostic reasoning signals within diverse final paths is key to generalization. Overall, ReGenesis advances the paradigm of self-improvement for reasoning generality, with practical implications for building more capable, broadly competent LLMs.

Abstract

Post-training Large Language Models (LLMs) with explicit reasoning trajectories can enhance their reasoning abilities. However, acquiring such high-quality trajectory data typically demands meticulous supervision from humans or superior models, which can be either expensive or license-constrained. In this paper, we explore how far an LLM can improve its reasoning by self-synthesizing reasoning paths as training data without any additional supervision. Existing self-synthesizing methods, such as STaR, suffer from poor generalization to out-of-domain (OOD) reasoning tasks. We hypothesize it is due to that their self-synthesized reasoning paths are too task-specific, lacking general task-agnostic reasoning guidance. To address this, we propose Reasoning Generalist via Self-Improvement (ReGenesis), a method to self-synthesize reasoning paths as post-training data by progressing from abstract to concrete. More specifically, ReGenesis self-synthesizes reasoning paths by converting general reasoning guidelines into task-specific ones, generating reasoning structures, and subsequently transforming these structures into reasoning paths, without the need for human-designed task-specific examples used in existing methods. We show that ReGenesis achieves superior performance on all in-domain and OOD settings tested compared to existing methods. For six OOD tasks specifically, while previous methods exhibited an average performance decrease of approximately 4.6% after post training, ReGenesis delivers around 6.1% performance improvement. We also conduct in-depth analysis of our framework and show ReGenesis is effective across various LLMs and design choices.

ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement

TL;DR

ReGenesis tackles the challenge of enabling LLMs to generalize reasoning beyond their training tasks by self-generating diverse, abstract-to-concrete reasoning paths as post-training data. The framework comprises Guidance Adaption, Reasoning Structure Generation, and Reasoning Path Generation, with filtering via ground-truth or self-consistency and optional ground-truth hints to recover valid reasoning. Empirical results show substantial in-domain gains (average around 16.56%) and meaningful out-of-domain improvements (about 6.1%), outperforming STaR, LMSI, and GT-based baselines, and demonstrating robustness across models like Mistral-7B-Instruct-v0.3 and Meta-Llama-3-8B-Instruct. The work provides broad analysis of design choices, model effects, and guideline preferences, and argues that preserving task-agnostic reasoning signals within diverse final paths is key to generalization. Overall, ReGenesis advances the paradigm of self-improvement for reasoning generality, with practical implications for building more capable, broadly competent LLMs.

Abstract

Post-training Large Language Models (LLMs) with explicit reasoning trajectories can enhance their reasoning abilities. However, acquiring such high-quality trajectory data typically demands meticulous supervision from humans or superior models, which can be either expensive or license-constrained. In this paper, we explore how far an LLM can improve its reasoning by self-synthesizing reasoning paths as training data without any additional supervision. Existing self-synthesizing methods, such as STaR, suffer from poor generalization to out-of-domain (OOD) reasoning tasks. We hypothesize it is due to that their self-synthesized reasoning paths are too task-specific, lacking general task-agnostic reasoning guidance. To address this, we propose Reasoning Generalist via Self-Improvement (ReGenesis), a method to self-synthesize reasoning paths as post-training data by progressing from abstract to concrete. More specifically, ReGenesis self-synthesizes reasoning paths by converting general reasoning guidelines into task-specific ones, generating reasoning structures, and subsequently transforming these structures into reasoning paths, without the need for human-designed task-specific examples used in existing methods. We show that ReGenesis achieves superior performance on all in-domain and OOD settings tested compared to existing methods. For six OOD tasks specifically, while previous methods exhibited an average performance decrease of approximately 4.6% after post training, ReGenesis delivers around 6.1% performance improvement. We also conduct in-depth analysis of our framework and show ReGenesis is effective across various LLMs and design choices.
Paper Structure (47 sections, 4 figures, 14 tables)

This paper contains 47 sections, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Overview of ReGenesis. We first create a set of reasoning guidelines , and use language model $\mathbf{M}$to tailor them more specifically to the task . The resulting adapted task-specific reasoning guidelines are fed into again to develop a set of detailed reasoning structures . Then each reasoning structure is used to generate a potential reasoning path and predicted answer $(r^{path}_{i,j}, \hat{a}_{i,j})$. We filter out incorrect reasoning paths using either ground truth or majority vote answers. For instructions that lack correct reasoning paths after filtering, we provide the ground truth as a hint and repeat the reasoning path generation. Finally, the correct reasoning paths, along with the instructions and answers, are used to fine-tune , enhancing its reasoning capability.
  • Figure 2: Normalized percentage of successful guideline utilization for selective general reasoning guidelines on the NumGLUE dataset, comparing Mistral-7B-Instruct-v0.3 and Meta-Llama-3-8B-Instruct models.
  • Figure 3: Accuracy is assessed across various dataset combinations, using $3$ random seeds for each combination. The total training sample size is $7,000$.
  • Figure 4: Zero-shot accuracy results on test set of GSM8K and StrategyQA.