Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Martin Pawelczyk; Lillian Sun; Zhenting Qi; Aounon Kumar; Himabindu Lakkaraju

Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Martin Pawelczyk, Lillian Sun, Zhenting Qi, Aounon Kumar, Himabindu Lakkaraju

TL;DR

The paper examines whether trustworthiness properties learned by a smaller model can transfer to a larger model via weak-to-strong supervision in large language models. It introduces two training strategies, Weak TFT and Weak+WTS TFT, that apply trustworthiness regularization during fine-tuning of the weak model (and also the strong transfer in the second strategy). Empirically, fairness and robustness properties (adversarial and OOD robustness) can transfer and even improve under joint regularization, while privacy transfers are not consistently observed. These results illuminate both the potential and the limitations of scaling trustworthiness through weak-to-strong training, offering actionable guidance for building more trustworthy AI systems at scale.

Abstract

The rapid proliferation of generative AI, especially large language models, has led to their integration into a variety of applications. A key phenomenon known as weak-to-strong generalization - where a strong model trained on a weak model's outputs surpasses the weak model in task performance - has gained significant attention. Yet, whether critical trustworthiness properties such as robustness, fairness, and privacy can generalize similarly remains an open question. In this work, we study this question by examining if a stronger model can inherit trustworthiness properties when fine-tuned on a weaker model's outputs, a process we term weak-to-strong trustworthiness generalization. To address this, we introduce two foundational training strategies: 1) Weak Trustworthiness Finetuning (Weak TFT), which leverages trustworthiness regularization during the fine-tuning of the weak model, and 2) Weak and Weak-to-Strong Trustworthiness Finetuning (Weak+WTS TFT), which extends regularization to both weak and strong models. Our experimental evaluation on real-world datasets reveals that while some trustworthiness properties, such as fairness, adversarial, and OOD robustness, show significant improvement in transfer when both models were regularized, others like privacy do not exhibit signs of weak-to-strong trustworthiness. As the first study to explore trustworthiness generalization via weak-to-strong generalization, our work provides valuable insights into the potential and limitations of weak-to-strong generalization.

Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

TL;DR

Abstract

Paper Structure (17 sections, 9 equations, 14 figures)

This paper contains 17 sections, 9 equations, 14 figures.

Introduction
Related Work
Methodology
Preliminaries
Eliciting Weak-to-Strong Trustworthiness in Large Language Models
Experimental Evaluation
Evaluating Trustworthiness of the Weak to Strong Model
Sensitivity Analysis
Conclusion
Weak to Strong Training Process
Training Objective for Weak+WTS TFT
Choosing the Hyperparameters Based on Trade-off Curves
Detailed Sensitivity Analysis
Dataset and Evaluation Details
Data Usage During Training and Evaluation
...and 2 more sections

Figures (14)

Figure 1: Weak-to-strong trustworthiness for Pythia 14M/410M models. Trustworthiness properties and task performance for our four properties: Fairness, OOD Robustness, Adversarial Robustness, and Privacy. Note that lower values are better for the top plot in Figure \ref{['fig:main_fairness']} as the y-axis is Unfairness (DPD). Similarly, lower values are better for the top plot in Figure \ref{['fig:main_privacy']} as the the y-axis is Extraction Rate. Results for WTS-Aux-Loss for privacy are omitted since it was the only task involving free data generation, making the auxiliary loss function inapplicable.
Figure 2: Varying Lambda for Weak+WTS TFT. Results for WTS-Aux-Loss for privacy are omitted since it was the only task involving free data generation, making the auxiliary loss function inapplicable.
Figure A1: Trade-off between original and adversarial accuracy for different training parameters.
Figure A2: Varying Lambda for Weak TFT. Results for WTS-Aux-Loss for privacy are omitted since it was the only task involving free data generation, making the auxiliary loss function inapplicable.
Figure A3: Varying model size for fairness.
...and 9 more figures

Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

TL;DR

Abstract

Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)