Table of Contents
Fetching ...

Your Weak LLM is Secretly a Strong Teacher for Alignment

Leitian Tao, Yixuan Li

TL;DR

This work investigates aligning large language models using feedback from weak, resource-efficient LLMs rather than relying solely on costly human annotations or ultra-large models. It formalizes a three-stage, semi-supervised framework in which a weak supervisor labeled on a small dataset generates preferences for a large unlabeled corpus, producing a weakly labeled dataset used to train a target policy via Direct Preference Optimization. Across diverse model families and tasks, the authors show that weak-LMM-based feedback can match or exceed human feedback in alignment quality, with the supervisor’s size exerting little influence on outcomes. They provide extensive ablations, including cases where weak feedback matches or even outperforms human judgments, and offer qualitative insights into when weak feedback diverges from or aligns with human preferences. The findings suggest a scalable approach to AI alignment that reduces human labor and computation while maintaining high-quality, robust alignment, with practical implications for deploying safer, value-aligned LLM systems.

Abstract

The burgeoning capabilities of large language models (LLMs) have underscored the need for alignment to ensure these models act in accordance with human values and intentions. Existing alignment frameworks present constraints either in the form of expensive human effort or high computational costs. This paper explores a promising middle ground, where we employ a weak LLM that is significantly less resource-intensive than top-tier models, yet offers more automation than purely human feedback. We present a systematic study to evaluate and understand weak LLM's ability to generate feedback for alignment. Our empirical findings demonstrate that weak LLMs can provide feedback that rivals or even exceeds that of fully human-annotated data. Our study indicates a minimized impact of model size on feedback efficacy, shedding light on a scalable and sustainable alignment strategy. To deepen our understanding of alignment under weak LLM feedback, we conduct a series of qualitative and quantitative analyses, offering novel insights into the quality discrepancies between human feedback vs. weak LLM feedback.

Your Weak LLM is Secretly a Strong Teacher for Alignment

TL;DR

This work investigates aligning large language models using feedback from weak, resource-efficient LLMs rather than relying solely on costly human annotations or ultra-large models. It formalizes a three-stage, semi-supervised framework in which a weak supervisor labeled on a small dataset generates preferences for a large unlabeled corpus, producing a weakly labeled dataset used to train a target policy via Direct Preference Optimization. Across diverse model families and tasks, the authors show that weak-LMM-based feedback can match or exceed human feedback in alignment quality, with the supervisor’s size exerting little influence on outcomes. They provide extensive ablations, including cases where weak feedback matches or even outperforms human judgments, and offer qualitative insights into when weak feedback diverges from or aligns with human preferences. The findings suggest a scalable approach to AI alignment that reduces human labor and computation while maintaining high-quality, robust alignment, with practical implications for deploying safer, value-aligned LLM systems.

Abstract

The burgeoning capabilities of large language models (LLMs) have underscored the need for alignment to ensure these models act in accordance with human values and intentions. Existing alignment frameworks present constraints either in the form of expensive human effort or high computational costs. This paper explores a promising middle ground, where we employ a weak LLM that is significantly less resource-intensive than top-tier models, yet offers more automation than purely human feedback. We present a systematic study to evaluate and understand weak LLM's ability to generate feedback for alignment. Our empirical findings demonstrate that weak LLMs can provide feedback that rivals or even exceeds that of fully human-annotated data. Our study indicates a minimized impact of model size on feedback efficacy, shedding light on a scalable and sustainable alignment strategy. To deepen our understanding of alignment under weak LLM feedback, we conduct a series of qualitative and quantitative analyses, offering novel insights into the quality discrepancies between human feedback vs. weak LLM feedback.
Paper Structure (44 sections, 7 equations, 7 figures, 12 tables)

This paper contains 44 sections, 7 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: A spectrum of feedback for aligning LLMs, ranging from labor-intensive human annotations (e.g., RLHF ouyang2022training) to highly automated, resource-intensive LLM feedback (e.g., RLAIF bai2022traininglee2023rlaif). Our work explores the largely untapped middle ground, evaluating weak LLM feedback for alignment.
  • Figure 2: (a) Alignment with feedback from a weak LLM (OPT-125M) can outperform human feedback. (b) Alignment performance on OPT-1.3B model under varying capability of supervisor. See Section \ref{['sec:exp_results']} for details.
  • Figure 3: (a) Results on different model families. (b) GPT-4 evaluation for different models aligned with weak LLM feedback vs. human feedback.
  • Figure 4: (a) Results under different data sizes. (b) Results on different tasks (TL;DR).
  • Figure 5: (a) Evaluation with the gold reward model reward-model-deberta-v3-large-v2. (b) Evaluation with the gold reward model RM-Mistral-7B.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 2.1: Preference data