Table of Contents
Fetching ...

Improving Weak-to-Strong Generalization with Reliability-Aware Alignment

Yue Guo, Yi Yang

TL;DR

This work proposes an approach to improve weak-to-strong generalization by involving the reliability of weak supervision signals in the alignment process, and presents effective techniques for error-robust model alignment, reducing error propagation from noisy supervision and enhancing the accuracy and reliability of LLMs.

Abstract

Large language models (LLMs) are now rapidly advancing and surpassing human abilities on many natural language tasks. However, aligning these super-human LLMs with human knowledge remains challenging because the supervision signals from human annotators may be wrong. This issue, known as the "super-alignment" problem, requires enhancing weak-to-strong generalization, where a strong LLM must generalize from imperfect supervision provided by a weaker source. To address this issue, we propose an approach to improve weak-to-strong generalization by involving the reliability of weak supervision signals in the alignment process. In our method, we query the weak supervisor for multiple answers, estimate the answer reliability, and enhance the alignment process by filtering out uncertain data or re-weighting reliable data. Experiments on four datasets demonstrate that our methods effectively identify the quality of weak labels and significantly enhance weak-to-strong generalization. Our work presents effective techniques for error-robust model alignment, reducing error propagation from noisy supervision and enhancing the accuracy and reliability of LLMs. Codes are publicly available at http://github.com/Irenehere/ReliableAlignment.

Improving Weak-to-Strong Generalization with Reliability-Aware Alignment

TL;DR

This work proposes an approach to improve weak-to-strong generalization by involving the reliability of weak supervision signals in the alignment process, and presents effective techniques for error-robust model alignment, reducing error propagation from noisy supervision and enhancing the accuracy and reliability of LLMs.

Abstract

Large language models (LLMs) are now rapidly advancing and surpassing human abilities on many natural language tasks. However, aligning these super-human LLMs with human knowledge remains challenging because the supervision signals from human annotators may be wrong. This issue, known as the "super-alignment" problem, requires enhancing weak-to-strong generalization, where a strong LLM must generalize from imperfect supervision provided by a weaker source. To address this issue, we propose an approach to improve weak-to-strong generalization by involving the reliability of weak supervision signals in the alignment process. In our method, we query the weak supervisor for multiple answers, estimate the answer reliability, and enhance the alignment process by filtering out uncertain data or re-weighting reliable data. Experiments on four datasets demonstrate that our methods effectively identify the quality of weak labels and significantly enhance weak-to-strong generalization. Our work presents effective techniques for error-robust model alignment, reducing error propagation from noisy supervision and enhancing the accuracy and reliability of LLMs. Codes are publicly available at http://github.com/Irenehere/ReliableAlignment.

Paper Structure

This paper contains 25 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison of naive and reliability-enhanced weak-to-strong alignment approaches. The naive weak-to-strong alignment method trains the strong model using the weak labels. Our improved method incorporates reliability estimation on the multiple answers and enhances the alignment process by considering the label reliability, leading to a more accurate response.
  • Figure 2: The method for enhancing the reliability of the weak-to-strong model alignment. First, we query the weak model for multiple weak labels. Then, we supervised fine-tuning (SFT) the strong model with uncertainty filtering and reliability re-weighting methods.
  • Figure 3: Relationship between entropy-based uncertainty scores and weak label accuracy in the Hellaswag (top) and MMLU (bottom) datasets using the Llama-7B as the weak supervisor. The x-axis represents entropy values, while the y-axis shows the count of correct and incorrect weak labels. The weak labels' accuracy of each entropy group is plotted on top of the bar. The accuracy monotonically decreases as the entropy increases.
  • Figure 4: Heatmap displaying the average reliability scores of the Llama2-7B model's predictions against different ground truth labels for the Hellaswag (top) and MMLU (bottom) validation sets. The x-axis represents the ground truth labels, and the y-axis represents the weak labels predicted by the model. Each cell shows the average reliability score for the corresponding prediction. The highest reliability scores are observed where the predicted labels match the ground truth labels.
  • Figure 5: Relationship between entropy-based uncertainty scores and weak label accuracy in the ETHICS-commonsense (left) and GSM8K (right) datasets using the Llama-7B as the weak supervisor.
  • ...and 1 more figures