Table of Contents
Fetching ...

Teaching Language Models to Self-Improve by Learning from Language Feedback

Chi Hu, Yimin Hu, Hang Cao, Tong Xiao, Jingbo Zhu

TL;DR

Self-Refinement Tuning (SRT) introduces a two-stage framework that aligns language models by learning from language-based feedback supplied by a stronger critic and then from the model's own self-generated feedback. The first stage trains base models to self-improve using critiques and refinements, while the second stage scales via self-generated feedback and Direct Preference Optimization (DPO). Across open-ended tasks, reasoning, and multilingual QA, SRT yields consistent gains across model sizes, with the 70B stage-2 model achieving a 25.8% win rate against GPT-4 Turbo on AlpacaEval 2.0 and outperforming several established baselines. The work highlights language feedback as a key alignment signal and demonstrates that reducing human-annotated preferences is feasible without sacrificing performance.

Abstract

Aligning Large Language Models (LLMs) with human intentions and values is crucial yet challenging. Current methods primarily rely on human preferences, which are costly and insufficient in capturing nuanced feedback expressed in natural language. In this paper, we present Self-Refinement Tuning (SRT), a method that leverages model feedback for alignment, thereby reducing reliance on human annotations. SRT uses a base language model (e.g., Tulu2) to generate initial responses, which are critiqued and refined by a more advanced model (e.g., GPT-4-Turbo). This process enables the base model to self-evaluate and improve its outputs, facilitating continuous learning. SRT further optimizes the model by learning from its self-generated feedback and refinements, creating a feedback loop that promotes model improvement. Our empirical evaluations demonstrate that SRT significantly outperforms strong baselines across diverse tasks and model sizes. When applied to a 70B parameter model, SRT increases the win rate from 9.6\% to 25.8\% on the AlpacaEval 2.0 benchmark, surpassing well-established systems such as GPT-4-0314, Claude 2, and Gemini. Our analysis highlights the crucial role of language feedback in the success of SRT, suggesting potential for further exploration in this direction.

Teaching Language Models to Self-Improve by Learning from Language Feedback

TL;DR

Self-Refinement Tuning (SRT) introduces a two-stage framework that aligns language models by learning from language-based feedback supplied by a stronger critic and then from the model's own self-generated feedback. The first stage trains base models to self-improve using critiques and refinements, while the second stage scales via self-generated feedback and Direct Preference Optimization (DPO). Across open-ended tasks, reasoning, and multilingual QA, SRT yields consistent gains across model sizes, with the 70B stage-2 model achieving a 25.8% win rate against GPT-4 Turbo on AlpacaEval 2.0 and outperforming several established baselines. The work highlights language feedback as a key alignment signal and demonstrates that reducing human-annotated preferences is feasible without sacrificing performance.

Abstract

Aligning Large Language Models (LLMs) with human intentions and values is crucial yet challenging. Current methods primarily rely on human preferences, which are costly and insufficient in capturing nuanced feedback expressed in natural language. In this paper, we present Self-Refinement Tuning (SRT), a method that leverages model feedback for alignment, thereby reducing reliance on human annotations. SRT uses a base language model (e.g., Tulu2) to generate initial responses, which are critiqued and refined by a more advanced model (e.g., GPT-4-Turbo). This process enables the base model to self-evaluate and improve its outputs, facilitating continuous learning. SRT further optimizes the model by learning from its self-generated feedback and refinements, creating a feedback loop that promotes model improvement. Our empirical evaluations demonstrate that SRT significantly outperforms strong baselines across diverse tasks and model sizes. When applied to a 70B parameter model, SRT increases the win rate from 9.6\% to 25.8\% on the AlpacaEval 2.0 benchmark, surpassing well-established systems such as GPT-4-0314, Claude 2, and Gemini. Our analysis highlights the crucial role of language feedback in the success of SRT, suggesting potential for further exploration in this direction.
Paper Structure (25 sections, 1 equation, 6 figures, 6 tables)

This paper contains 25 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Results on AlpacaEval 2.0. SRT significantly boosts the performance of the base Tulu2 models. We report the win rates against GPT-4 Turbo.
  • Figure 2: An overview of Self-Refinement Tuning (SRT). In the first stage (above), SRT teaches the base model to self-improve by fine-tuning it on the feedback and refinements from a powerful critic model. In the second stage (bottom), SRT enables the model to learn from its self-generated feedback and refinements.
  • Figure 3: The score distribution of 25K initial responses (left) and refined responses (right). The initial responses are generated by Tulu2-7B and are then refined and scored by GPT-4 Turbo using the template presented in Table \ref{['table:feedback_template']}.
  • Figure 4: Self-Refinement vs. Re-Ranking over 16 candidates. The results are obtained on the AlpacaEval test set using models trained at the first stage of SRT.
  • Figure 5: Win rates against GPT-4 Turbo by varying numbers of training samples for SRT models.
  • ...and 1 more figures