Table of Contents
Fetching ...

Self-Evolution Fine-Tuning for Policy Optimization

Ruijun Chen, Jiehao Liang, Shiping Gao, Fanqi Wan, Xiaojun Quan

TL;DR

This paper tackles the problem of aligning large language models without heavy annotated data or unstable optimization. It introduces Self-Evolution Fine-Tuning (SEFT), which trains an adaptive reviser to upgrade low-quality replies and uses these revisions as pseudo-labels to fine-tune the policy. SEFT enables internal and external evolution, allowing the policy to improve within its own response space and then in a stronger external space, while leveraging unlabeled data. Experiments on Nectar, UltraFeedback, AlpacaEval 2.0 and MT-Bench show SEFT outperforms SFT, DPO, and ORPO and benefits from additional unlabeled data.

Abstract

The alignment of large language models (LLMs) is crucial not only for unlocking their potential in specific tasks but also for ensuring that responses meet human expectations and adhere to safety and ethical principles. Current alignment methodologies face considerable challenges. For instance, supervised fine-tuning (SFT) requires extensive, high-quality annotated samples, while reinforcement learning from human feedback (RLHF) is complex and often unstable. In this paper, we introduce self-evolution fine-tuning (SEFT) for policy optimization, with the aim of eliminating the need for annotated samples while retaining the stability and efficiency of SFT. SEFT first trains an adaptive reviser to elevate low-quality responses while maintaining high-quality ones. The reviser then gradually guides the policy's optimization by fine-tuning it with enhanced responses. One of the prominent features of this method is its ability to leverage unlimited amounts of unannotated data for policy optimization through supervised fine-tuning. Our experiments on AlpacaEval 2.0 and MT-Bench demonstrate the effectiveness of SEFT. We also provide a comprehensive analysis of its advantages over existing alignment techniques.

Self-Evolution Fine-Tuning for Policy Optimization

TL;DR

This paper tackles the problem of aligning large language models without heavy annotated data or unstable optimization. It introduces Self-Evolution Fine-Tuning (SEFT), which trains an adaptive reviser to upgrade low-quality replies and uses these revisions as pseudo-labels to fine-tune the policy. SEFT enables internal and external evolution, allowing the policy to improve within its own response space and then in a stronger external space, while leveraging unlabeled data. Experiments on Nectar, UltraFeedback, AlpacaEval 2.0 and MT-Bench show SEFT outperforms SFT, DPO, and ORPO and benefits from additional unlabeled data.

Abstract

The alignment of large language models (LLMs) is crucial not only for unlocking their potential in specific tasks but also for ensuring that responses meet human expectations and adhere to safety and ethical principles. Current alignment methodologies face considerable challenges. For instance, supervised fine-tuning (SFT) requires extensive, high-quality annotated samples, while reinforcement learning from human feedback (RLHF) is complex and often unstable. In this paper, we introduce self-evolution fine-tuning (SEFT) for policy optimization, with the aim of eliminating the need for annotated samples while retaining the stability and efficiency of SFT. SEFT first trains an adaptive reviser to elevate low-quality responses while maintaining high-quality ones. The reviser then gradually guides the policy's optimization by fine-tuning it with enhanced responses. One of the prominent features of this method is its ability to leverage unlimited amounts of unannotated data for policy optimization through supervised fine-tuning. Our experiments on AlpacaEval 2.0 and MT-Bench demonstrate the effectiveness of SEFT. We also provide a comprehensive analysis of its advantages over existing alignment techniques.
Paper Structure (29 sections, 4 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 29 sections, 4 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Reward scores of initial and revised responses on our Nectar test set starling2023. OpenChat-3.5-7B wang2023openchat is employed as the base model for training the reviser on Nectar's training set, and Starling-RM-7B-alpha starling2023 is used for scoring each response. Each point represents the scores of an initial response (x-axis) and the revised response (y-axis). The red dashed line shows where each pair of scores is equal, and the green line shows the trend of score changes after revisions.
  • Figure 2: Overview of SEFT. The reviser takes prompts and initial responses of varying quality as input, assesses the difficulty of revising these responses, and assigns appropriate revision labels to generate overall high-quality responses. During policy optimization, the policy first undergoes internal evolution: the reviser revises the responses generated by the policy and uses them to fine-tune the policy. Then, the policy undergoes external evolution with a stronger model, progressively enhancing the quality of alignment data and guiding the policy toward generating better responses.
  • Figure 3: Illustrative training examples for the adaptive reviser. The objective of the adaptive reviser is to make revisions where feasible and avoid attempting those beyond its capabilities.
  • Figure 4: Performance comparison of the adaptive reviser and baseline methods on the Nectar test set. Due to the limit of space, we only present the results for responses ranked 0, 3, and 6.
  • Figure 5: Improvement rate of different revisers on the Nectar test set. The metric is defined as the proportion of revised instances showing enhanced quality at each rank. Note that instances that stay unchanged by revisers are excluded from this calculation.
  • ...and 3 more figures