Table of Contents
Fetching ...

Bayesian WeakS-to-Strong from Text Classification to Generation

Ziyun Cui, Ziyang Zhang, Guangzhi Sun, Wen Wu, Chao Zhang

TL;DR

This work extends Weak-to-Strong reasoning by introducing Bayesian WeakS-to-Strong, an ensemble-based framework that uses multiple weak models to mimic diverse human opinions and estimates a distribution over weak labels with evidential deep learning. It generalizes the approach from text classification to text generation by deriving token-level soft labels through a word-bridge mechanism and employs direct preference optimization, including a conservative variant, to refine the strong model’s behavior. Empirical results on SciQ, SLURP, and CosmosQA demonstrate that Bayesian WeakS-to-Strong consistently outperforms naive ensembles and other baselines, achieving notable gains in both classification accuracy and generation alignment (SLU-F1 and PGR). The findings highlight the value of modeling opinion diversity and uncertainty in supervision for robust, trustworthy strong models, with implications for scalable superalignment in future AI systems.

Abstract

Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model's preference learning, beyond the basic learning framework of teacher forcing. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.

Bayesian WeakS-to-Strong from Text Classification to Generation

TL;DR

This work extends Weak-to-Strong reasoning by introducing Bayesian WeakS-to-Strong, an ensemble-based framework that uses multiple weak models to mimic diverse human opinions and estimates a distribution over weak labels with evidential deep learning. It generalizes the approach from text classification to text generation by deriving token-level soft labels through a word-bridge mechanism and employs direct preference optimization, including a conservative variant, to refine the strong model’s behavior. Empirical results on SciQ, SLURP, and CosmosQA demonstrate that Bayesian WeakS-to-Strong consistently outperforms naive ensembles and other baselines, achieving notable gains in both classification accuracy and generation alignment (SLU-F1 and PGR). The findings highlight the value of modeling opinion diversity and uncertainty in supervision for robust, trustworthy strong models, with implications for scalable superalignment in future AI systems.

Abstract

Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model's preference learning, beyond the basic learning framework of teacher forcing. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.
Paper Structure (30 sections, 13 equations, 4 figures, 11 tables)

This paper contains 30 sections, 13 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: An overview diagram of the three ensemble approaches: (a) Naive Multi-Weak: directly learn all weak labels produced by weak models, (b) Joint Decoding: weak models collaboratively determine one single target, (c) Bayesian Multi-Weak: learn a prior distribution over weak labels.
  • Figure 2: The process of transforming per-token confidence scores from the sequence tokenized by the weak model to the sequence tokenized by the strong. The word "hello" is used as an example. Stage 1: The words and word scores are obtained from the weak model wordpieces and their scores. Stage 2: The words are tokenized by the strong model tokenizer, and the tokenized sequences are fed into the strong model to obtain the strong model predicted probability (denoted as confidence) for each token $s_i$. This strong model confidence is then used to split word scores into target wordpiece probabilities $P(s_i)$ while keeping the probability of the word unchanged. Stage 3: The obtained target probability is transformed into the label. Probabilities of other categories are calculated by scaling the strong output distribution using $P(s_i)$.
  • Figure 3: Agreement of weak models. The similarity between classification models was assessed by calculating the accuracy of each model's predictions against the others on the test set. For generation models, the agreement is obtained through the Levenshtein distance.
  • Figure :