A Critical Evaluation of AI Feedback for Aligning Large Language Models

Archit Sharma; Sedrick Keh; Eric Mitchell; Chelsea Finn; Kushal Arora; Thomas Kollar

A Critical Evaluation of AI Feedback for Aligning Large Language Models

Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, Thomas Kollar

TL;DR

This work critically reexamines the purported benefits of reinforcement learning from AI feedback (RLAIF) for aligning large language models. Across multiple base models and evaluation setups, the authors find that gains attributed to RLAIF largely arise from mismatches between the SFT data (teacher quality) and the AI feedback (critic quality), and that SFT on completions from strong teachers (e.g., GPT-4) can match or exceed RLAIF. The results challenge the assumption that the two-stage RLHF/RLAIF pipeline universally outperforms supervised fine-tuning, and they offer mechanistic explanations and practical guidance for dataset construction, evaluation, and when AI feedback is truly beneficial. The paper thus calls for careful consideration of data distributions and model-critic dynamics in future RLAIF and RLHF research, as well as ongoing dataset versioning and alignment with stronger LLMs.

Abstract

Reinforcement learning with AI feedback (RLAIF) is a popular paradigm for improving the instruction-following abilities of powerful pre-trained language models. RLAIF first performs supervised fine-tuning (SFT) using demonstrations from a teacher model and then further fine-tunes the model with reinforcement learning (RL), using feedback from a critic model. While recent popular open-source models have demonstrated substantial improvements in performance from the RL step, in this paper we question whether the complexity of this RL step is truly warranted for AI feedback. We show that the improvements of the RL step are virtually entirely due to the widespread practice of using a weaker teacher model (e.g. GPT-3.5) for SFT data collection than the critic (e.g., GPT-4) used for AI feedback generation. Specifically, we show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines. More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models. Finally, we provide a mechanistic explanation for when SFT may outperform the full two-step RLAIF pipeline as well as suggestions for making RLAIF maximally useful in practice.

A Critical Evaluation of AI Feedback for Aligning Large Language Models

TL;DR

Abstract

Paper Structure (15 sections, 4 equations, 7 figures, 1 table)

This paper contains 15 sections, 4 equations, 7 figures, 1 table.

Introduction
Related Work
Algorithmic Overview of LLM Fine-tuning
Experiments
RLAIF Setup and Data Scaling for SFT
Fine-tuning under Different Target Distributions
Possible Mechanistic Explanations for the Ineffectiveness of RLAIF
Going Forward
Conclusion
Acknowledgements
Detailed Experimental Setup and Hyperparameters
Data Processing
Training Settings and Hyperparameters
Preference Labeling and Evaluation Prompt Templates
Additional Results

Figures (7)

Figure 1: Supervised fine-tuning (SFT) on strong teachers can accounts for improvements from reinforcement learning from AI feedback (RLAIF). RLAIF from strong models such as GPT-4 can result in substantially better instruction-following LLMs than supervised SFT alone on popular datasets such as ShareGPT chiang2023vicuna constructed using GPT-3.5 completions ( GPT-3.5 SFT + GPT-4 AIF). However, simply performing SFT on completions from GPT-4 can result in a better model ( GPT-4 SFT), suggesting that improvement in performance from RLAIF is partly because the default ShareGPT completions are from a weak teacher (GPT-3.5). Furthermore, RLAIF ( GPT-4 SFT + GPT-4 AIF) does not result in a significantly better model compared to GPT-4 SFT alone.
Figure 2: The performance improvements from increasing the number of training points SFT on 100% of the training prompts yields minimal improvements over SFT on 10% of the training prompts. Hence, for our RLAIF setting, we first perform SFT on only 10% of the training examples, and we use the remaining for RLAIF.
Figure 3: SFT can perform comparably or better than RLAIF across various model sizes and classes. The same set of prompts are used for all three settings and for each model, and the oracle LLM either generates a completion (for SFT) or a preference label on two completions (for RLAIF). For RLAIF, SFT is an important precursor, so we SFT on 10% of the total prompts, and RLAIF is done on the remaining 90%. For the other settings, the full set of prompts are used for SFT. While RLAIF improves the performance compared to SFT on the default ShareGPT completions, SFT on GPT-4 completions consistently matches or outperforms RLAIF.
Figure 4: We make a similar observation that SFT performs comparably to RLAIF when Claude as is used as an oracle. RLAIF with AI feedback from does not significantly outperform SFT on Claude completions, and the performance improvement from RLAIF is explained by the use of a weaker SFT target distribution (GPT-3.5). The results for effectiveness of SFT may apply more generally to strong LLMs beyond GPT-4.
Figure 5: SFT on strong distribution minimizes any improvements from RLAIF. We consider RLAIF starting with GPT-4 completions as the target distribution for SFT in the first step, and also to generate preference labels for RL. We find surprisingly minimal improvements in performance when compared to SFT on completions sampled from GPT-4 for all prompts. This experiment suggests that SFT on the target completions from the strong critic can potentially minimize any further gains from RLAIF.
...and 2 more figures

A Critical Evaluation of AI Feedback for Aligning Large Language Models

TL;DR

Abstract

A Critical Evaluation of AI Feedback for Aligning Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)