Table of Contents
Fetching ...

Evolving LLMs' Self-Refinement Capability via Synergistic Training-Inference Optimization

Yongcheng Zeng, Xinyu Cui, Xuanfa Jin, Qirui Mi, Guoqing Liu, Zexu Sun, Mengyue Yang, Dong Li, Weiyu Ma, Ning Yang, Jian Zhao, Jianye Hao, Haifeng Zhang, Jun Wang

TL;DR

The paper investigates whether large language models inherently possess Self-Refinement and finds that they often do not, potentially degrading without activation. It introduces EVOLVE, a framework that synergistically couples training (SFT and Preference Training) with inference-time Self-Refinement to iteratively improve refinement capabilities and create higher-quality data for training. Through iterative training and data collection, EVOLVE achieves strong performance on AlpacaEval 2 and Arena-Hard, outpacing GPT-4o in certain settings and showing generalization to GSM8K and MATH without domain-specific math data. The work also explores the prospect of Self-Refinement enabling Self-Improvement of intrinsic abilities, though it notes noise and stability challenges in unsupervised scenarios. Overall, EVOLVE demonstrates a scalable path toward sustained refinement-driven improvement in LLMs with practical implications for reasoning and decision-support tasks.

Abstract

Self-Refinement refers to a model's ability to revise its own responses to produce improved outputs. This capability can also serve as a fundamental mechanism for Self-Improvement, for example, by reconstructing datasets with refined results to enhance intrinsic model performance. However, our comprehensive experiments reveal that large language models (LLMs) show no clear evidence of inherent Self-Refinement and may even experience response quality degradation after Self-Refinement. To address this issue, we propose EVOLVE, a simple and effective framework for eliciting and tracking the evolution of Self-Refinement through iterative training. We first explore optimization methods during training to activate the model's Self-Refinement capability. Then, at inference, we investigate various generation strategies to further enhance and utilize Self-Refinement while supplying the necessary data for training. Through synergistic optimization of training and inference stages, we continually evolve the model's Self-Refinement ability, enabling it to better refine its own responses. Moreover, we demonstrate the potential of leveraging Self-Refinement to achieve broader Self-Improvement of intrinsic model abilities. Experiments show that the evolved Self-Refinement ability enables the Llama-3.1-8B base model to surpass GPT-4o, achieving 62.3% length-controlled and 63.3% raw win rates on AlpacaEval 2, and 50.3% on Arena-Hard. It also generalizes effectively to out-of-domain reasoning tasks, improving performance on mathematical reasoning benchmarks such as GSM8K and MATH.

Evolving LLMs' Self-Refinement Capability via Synergistic Training-Inference Optimization

TL;DR

The paper investigates whether large language models inherently possess Self-Refinement and finds that they often do not, potentially degrading without activation. It introduces EVOLVE, a framework that synergistically couples training (SFT and Preference Training) with inference-time Self-Refinement to iteratively improve refinement capabilities and create higher-quality data for training. Through iterative training and data collection, EVOLVE achieves strong performance on AlpacaEval 2 and Arena-Hard, outpacing GPT-4o in certain settings and showing generalization to GSM8K and MATH without domain-specific math data. The work also explores the prospect of Self-Refinement enabling Self-Improvement of intrinsic abilities, though it notes noise and stability challenges in unsupervised scenarios. Overall, EVOLVE demonstrates a scalable path toward sustained refinement-driven improvement in LLMs with practical implications for reasoning and decision-support tasks.

Abstract

Self-Refinement refers to a model's ability to revise its own responses to produce improved outputs. This capability can also serve as a fundamental mechanism for Self-Improvement, for example, by reconstructing datasets with refined results to enhance intrinsic model performance. However, our comprehensive experiments reveal that large language models (LLMs) show no clear evidence of inherent Self-Refinement and may even experience response quality degradation after Self-Refinement. To address this issue, we propose EVOLVE, a simple and effective framework for eliciting and tracking the evolution of Self-Refinement through iterative training. We first explore optimization methods during training to activate the model's Self-Refinement capability. Then, at inference, we investigate various generation strategies to further enhance and utilize Self-Refinement while supplying the necessary data for training. Through synergistic optimization of training and inference stages, we continually evolve the model's Self-Refinement ability, enabling it to better refine its own responses. Moreover, we demonstrate the potential of leveraging Self-Refinement to achieve broader Self-Improvement of intrinsic model abilities. Experiments show that the evolved Self-Refinement ability enables the Llama-3.1-8B base model to surpass GPT-4o, achieving 62.3% length-controlled and 63.3% raw win rates on AlpacaEval 2, and 50.3% on Arena-Hard. It also generalizes effectively to out-of-domain reasoning tasks, improving performance on mathematical reasoning benchmarks such as GSM8K and MATH.

Paper Structure

This paper contains 36 sections, 20 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Evaluation of Self-Refinement Capability Across Various Models. We use three refinement templates to minimize prompt bias. The x-axis denotes the inference iteration number. For each turn, responses are generated from 256 https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized test set samples, using the original prompt and the prior turn's output. These are then scored by the https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2liu2024skywork. To eliminate potential randomness, the reported values are the mean score of three independent runs with different random seeds; higher scores indicate better quality. Templates are detailed in \ref{['appendix:motivation']}.
  • Figure 2: Our framework, EVOLVE, iteratively alternates between inference and training processes. In iteration $t$, Model $M_t$ uses the Self-Refinement strategy to collect preference data, which is then utilized to enhance the model's intrinsic capabilities via preference training (Eq. \ref{['loss']}), yielding the next iteration model $M_{t+1}$. The dataset is filtered through either a rule-based method or a reward model.
  • Figure 3: Ablation of training combinations. SFT activates Self-Refinement, PT enhances it, and their synergy (blue, ours) yields the best performance.
  • Figure 4: Illustration of four dynamic generation strategies.
  • Figure 5: Performance of four generation strategies. Chain of Self-Refinement achieves the best results across iterations.
  • ...and 10 more figures