Table of Contents
Fetching ...

Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Martin Pawelczyk, Lillian Sun, Zhenting Qi, Aounon Kumar, Himabindu Lakkaraju

TL;DR

The paper examines whether trustworthiness properties learned by a smaller model can transfer to a larger model via weak-to-strong supervision in large language models. It introduces two training strategies, Weak TFT and Weak+WTS TFT, that apply trustworthiness regularization during fine-tuning of the weak model (and also the strong transfer in the second strategy). Empirically, fairness and robustness properties (adversarial and OOD robustness) can transfer and even improve under joint regularization, while privacy transfers are not consistently observed. These results illuminate both the potential and the limitations of scaling trustworthiness through weak-to-strong training, offering actionable guidance for building more trustworthy AI systems at scale.

Abstract

The rapid proliferation of generative AI, especially large language models, has led to their integration into a variety of applications. A key phenomenon known as weak-to-strong generalization - where a strong model trained on a weak model's outputs surpasses the weak model in task performance - has gained significant attention. Yet, whether critical trustworthiness properties such as robustness, fairness, and privacy can generalize similarly remains an open question. In this work, we study this question by examining if a stronger model can inherit trustworthiness properties when fine-tuned on a weaker model's outputs, a process we term weak-to-strong trustworthiness generalization. To address this, we introduce two foundational training strategies: 1) Weak Trustworthiness Finetuning (Weak TFT), which leverages trustworthiness regularization during the fine-tuning of the weak model, and 2) Weak and Weak-to-Strong Trustworthiness Finetuning (Weak+WTS TFT), which extends regularization to both weak and strong models. Our experimental evaluation on real-world datasets reveals that while some trustworthiness properties, such as fairness, adversarial, and OOD robustness, show significant improvement in transfer when both models were regularized, others like privacy do not exhibit signs of weak-to-strong trustworthiness. As the first study to explore trustworthiness generalization via weak-to-strong generalization, our work provides valuable insights into the potential and limitations of weak-to-strong generalization.

Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

TL;DR

The paper examines whether trustworthiness properties learned by a smaller model can transfer to a larger model via weak-to-strong supervision in large language models. It introduces two training strategies, Weak TFT and Weak+WTS TFT, that apply trustworthiness regularization during fine-tuning of the weak model (and also the strong transfer in the second strategy). Empirically, fairness and robustness properties (adversarial and OOD robustness) can transfer and even improve under joint regularization, while privacy transfers are not consistently observed. These results illuminate both the potential and the limitations of scaling trustworthiness through weak-to-strong training, offering actionable guidance for building more trustworthy AI systems at scale.

Abstract

The rapid proliferation of generative AI, especially large language models, has led to their integration into a variety of applications. A key phenomenon known as weak-to-strong generalization - where a strong model trained on a weak model's outputs surpasses the weak model in task performance - has gained significant attention. Yet, whether critical trustworthiness properties such as robustness, fairness, and privacy can generalize similarly remains an open question. In this work, we study this question by examining if a stronger model can inherit trustworthiness properties when fine-tuned on a weaker model's outputs, a process we term weak-to-strong trustworthiness generalization. To address this, we introduce two foundational training strategies: 1) Weak Trustworthiness Finetuning (Weak TFT), which leverages trustworthiness regularization during the fine-tuning of the weak model, and 2) Weak and Weak-to-Strong Trustworthiness Finetuning (Weak+WTS TFT), which extends regularization to both weak and strong models. Our experimental evaluation on real-world datasets reveals that while some trustworthiness properties, such as fairness, adversarial, and OOD robustness, show significant improvement in transfer when both models were regularized, others like privacy do not exhibit signs of weak-to-strong trustworthiness. As the first study to explore trustworthiness generalization via weak-to-strong generalization, our work provides valuable insights into the potential and limitations of weak-to-strong generalization.
Paper Structure (17 sections, 9 equations, 14 figures)

This paper contains 17 sections, 9 equations, 14 figures.

Figures (14)

  • Figure 1: Weak-to-strong trustworthiness for Pythia 14M/410M models. Trustworthiness properties and task performance for our four properties: Fairness, OOD Robustness, Adversarial Robustness, and Privacy. Note that lower values are better for the top plot in Figure \ref{['fig:main_fairness']} as the y-axis is Unfairness (DPD). Similarly, lower values are better for the top plot in Figure \ref{['fig:main_privacy']} as the the y-axis is Extraction Rate. Results for WTS-Aux-Loss for privacy are omitted since it was the only task involving free data generation, making the auxiliary loss function inapplicable.
  • Figure 2: Varying Lambda for Weak+WTS TFT. Results for WTS-Aux-Loss for privacy are omitted since it was the only task involving free data generation, making the auxiliary loss function inapplicable.
  • Figure A1: Trade-off between original and adversarial accuracy for different training parameters.
  • Figure A2: Varying Lambda for Weak TFT. Results for WTS-Aux-Loss for privacy are omitted since it was the only task involving free data generation, making the auxiliary loss function inapplicable.
  • Figure A3: Varying model size for fairness.
  • ...and 9 more figures