Efficient Model-agnostic Alignment via Bayesian Persuasion

Fengshuo Bai; Mingzhi Wang; Zhaowei Zhang; Boyuan Chen; Yinda Xu; Ying Wen; Yaodong Yang

Efficient Model-agnostic Alignment via Bayesian Persuasion

Fengshuo Bai, Mingzhi Wang, Zhaowei Zhang, Boyuan Chen, Yinda Xu, Ying Wen, Yaodong Yang

TL;DR

This work reframes LLM alignment as Bayesian persuasion between a small Advisor and large Receivers, enabling a parameter-efficient, model-agnostic signaling approach that does not require retraining large models. By optimizing signals under a prior $\mu_x^0$ to influence Receiver beliefs $\mu_x^g$, the framework achieves improved outputs with a sublinear regret bound $R_T = O(m^{3/2} \sqrt{T \log T})$ and notable empirical gains on math and code tasks across diverse Receivers. Theoretical and empirical results demonstrate that a lightweight Advisor (e.g., GPT-2 or Phi-2) can meaningfully enhance the performance of fixed large models, suggesting a scalable direction for alignment that reduces ground-truth labeling needs. Overall, the paper advocates information-design principles as a practical and generalizable route to efficient alignment with black-box models, with robust potential for broader tasks.

Abstract

With recent advancements in large language models (LLMs), alignment has emerged as an effective technique for keeping LLMs consensus with human intent. Current methods primarily involve direct training through Supervised Fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), both of which require substantial computational resources and extensive ground truth data. This paper explores an efficient method for aligning black-box large models using smaller models, introducing a model-agnostic and lightweight Bayesian Persuasion Alignment framework. We formalize this problem as an optimization of the signaling strategy from the small model's perspective. In the persuasion process, the small model (Advisor) observes the information item (i.e., state) and persuades large models (Receiver) to elicit improved responses. The Receiver then generates a response based on the input, the signal from the Advisor, and its updated belief about the information item. Through training using our framework, we demonstrate that the Advisor can significantly enhance the performance of various Receivers across a range of tasks. We theoretically analyze our persuasion framework and provide an upper bound on the Advisor's regret, confirming its effectiveness in learning the optimal signaling strategy. Our Empirical results demonstrates that GPT-2 can significantly improve the performance of various models, achieving an average enhancement of 16.1% in mathematical reasoning ability and 13.7% in code generation. We hope our work can provide an initial step toward rethinking the alignment framework from the Bayesian Persuasion perspective.

Efficient Model-agnostic Alignment via Bayesian Persuasion

TL;DR

to influence Receiver beliefs

, the framework achieves improved outputs with a sublinear regret bound

and notable empirical gains on math and code tasks across diverse Receivers. Theoretical and empirical results demonstrate that a lightweight Advisor (e.g., GPT-2 or Phi-2) can meaningfully enhance the performance of fixed large models, suggesting a scalable direction for alignment that reduces ground-truth labeling needs. Overall, the paper advocates information-design principles as a practical and generalizable route to efficient alignment with black-box models, with robust potential for broader tasks.

Abstract

Paper Structure (28 sections, 4 theorems, 12 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 28 sections, 4 theorems, 12 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Bayesian Persuasion Alignment
Protocol and Notations
Signaling Strategy and Belief Update
Signaling Strategy Optimization
Regret Analysis
Experiments
Settings
Results
Performance on Persuasion
Impact on Information Structure
Easy-to-Hard Generalization
Efficiency on Persuasion
Conclusion
...and 13 more sections

Key Result

Theorem 1

Algorithm algo:online_bp guarantees regret $R_T = O(m^{3/2} \sqrt{T \log T})$, where $m = |{\mathcal{Y}}|$ is the number of receiver's reponses.

Figures (3)

Figure 1: An illustration of our persuasion framework. The Receiver observes a signal $g$ sent by the Advisor and updates its belief from the prior distribution $\mu^0$ to a posterior distribution $\mu^g$. The axes depict the Advisor's expected utility $\hat{v}(\mu)$ across various information distributions $\mu$. In this context, $co(\hat{v})$ denotes the convex hull of $\hat{v}$, while $V$ represents the concave closure of $\hat{v}$. Here, $V(\cdot)$ is the largest expected utility Advisor can achieve with any signal. From the Advisor's perspective, the Receiver's performance is enhanced following persuasion.
Figure 2: Performance of Receiver on easy and hard problems. The Advisor (GPT-2 and Phi-2) is trained on easy problems of training set (MATH level 1-3), and we observe that the capability of the Receiver greatly improved on both easy and hard tasks with the persuasion signal of the Advisor. In both subfigures, our method outperforms scenarios where no information or only prior information is given, and it even surpasses scenarios where all information is provided for most Receiver models.
Figure 3: Average Relative Accuracy Improvement on GSM8K. We compare two posterior structure from Advisor with 'All Information' and 'Prior Information'. The left y-axis represents the increase in prompt token length relative to the 'No Information', while the right y-axis displays the average accuracy across various models on GSM8K.

Theorems & Definitions (8)

Theorem 1
Definition 1: Linear Map
Theorem 2: Caratheodory's Theorem cook1972caratheodory
Theorem 3: bernasconi2023optimal, Theorem 3.2
proof
Corollary 1: bernasconi2023optimal, Corollary 3.4
proof
proof

Efficient Model-agnostic Alignment via Bayesian Persuasion

TL;DR

Abstract

Efficient Model-agnostic Alignment via Bayesian Persuasion

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (8)