Information Theoretic Guarantees For Policy Alignment In Large Language Models

Youssef Mroueh

Information Theoretic Guarantees For Policy Alignment In Large Language Models

Youssef Mroueh

TL;DR

This work provides a rigorous information-theoretic analysis of policy alignment for large language models under RLHF and best-of-$n$ sampling. It establishes that reward improvements are tightly governed by tail properties of the reference reward via transportation inequalities, yielding $\sqrt{\mathsf{KL}}$-type bounds under sub-Gaussian tails and extending to $f$-divergences and Rényi divergences. A key technical contribution is the reduction to exponential order statistics and the data-processing inequality, which enables exact KL bounds for best-of-$n$ and their generalization beyond finite alphabets. The authors also develop tail-adaptive transport bounds using Rényi divergence to potentially tighten the limits, and they analyze the transfer of these inequalities from proxy rewards to golden rewards, explaining Goodhart-like deterioration due to overestimation. Overall, the results delineate fundamental limits and guide practical design of alignment objectives and evaluation.

Abstract

Policy alignment of large language models refers to constrained policy optimization, where the policy is optimized to maximize a reward while staying close to a reference policy with respect to an $f$-divergence such as the $\mathsf{KL}$ divergence. The best of $n$ alignment policy selects a sample from the reference policy that has the maximum reward among $n$ independent samples. For both cases (policy alignment and best of $n$), recent works showed empirically that the reward improvement of the aligned policy on the reference one scales like $\sqrt{\mathsf{KL}}$, with an explicit bound in $n$ on the $\mathsf{KL}$ for the best of $n$ policy. We show in this paper that the $\sqrt{\mathsf{KL}}$ information theoretic upper bound holds if the reward under the reference policy has sub-gaussian tails. Moreover, we prove for the best of $n$ policy, that the $\mathsf{KL}$ upper bound can be obtained for any $f$-divergence via a reduction to exponential order statistics owing to the Rényi representation of order statistics, and a data processing inequality. If additional information is known on the tails of the aligned policy we show that tighter control on the reward improvement can be obtained via the Rényi divergence. Finally we demonstrate how these upper bounds transfer from proxy rewards to golden rewards which results in a decrease in the golden reward improvement due to overestimation and approximation errors of the proxy reward.

Information Theoretic Guarantees For Policy Alignment In Large Language Models

TL;DR

This work provides a rigorous information-theoretic analysis of policy alignment for large language models under RLHF and best-of-

sampling. It establishes that reward improvements are tightly governed by tail properties of the reference reward via transportation inequalities, yielding

-type bounds under sub-Gaussian tails and extending to

-divergences and Rényi divergences. A key technical contribution is the reduction to exponential order statistics and the data-processing inequality, which enables exact KL bounds for best-of-

and their generalization beyond finite alphabets. The authors also develop tail-adaptive transport bounds using Rényi divergence to potentially tighten the limits, and they analyze the transfer of these inequalities from proxy rewards to golden rewards, explaining Goodhart-like deterioration due to overestimation. Overall, the results delineate fundamental limits and guide practical design of alignment objectives and evaluation.

Abstract

Policy alignment of large language models refers to constrained policy optimization, where the policy is optimized to maximize a reward while staying close to a reference policy with respect to an

-divergence such as the

divergence. The best of

alignment policy selects a sample from the reference policy that has the maximum reward among

independent samples. For both cases (policy alignment and best of

), recent works showed empirically that the reward improvement of the aligned policy on the reference one scales like

, with an explicit bound in

on the

for the best of

policy. We show in this paper that the

information theoretic upper bound holds if the reward under the reference policy has sub-gaussian tails. Moreover, we prove for the best of

policy, that the

upper bound can be obtained for any

-divergence via a reduction to exponential order statistics owing to the Rényi representation of order statistics, and a data processing inequality. If additional information is known on the tails of the aligned policy we show that tighter control on the reward improvement can be obtained via the Rényi divergence. Finally we demonstrate how these upper bounds transfer from proxy rewards to golden rewards which results in a decrease in the golden reward improvement due to overestimation and approximation errors of the proxy reward.

Paper Structure (26 sections, 19 theorems, 109 equations, 1 figure, 1 table)

This paper contains 26 sections, 19 theorems, 109 equations, 1 figure, 1 table.

Introduction
The Alignment Problem
RLHF: A Constrained Policy Optimization Problem
Best of $n$ Policy Alignment
Reward Improvement Guarantees Through Transportation Inequalities
Notations
Scaling Laws in Alignment
Transportation Inequalities with $\mathsf{KL}$ Divergence
Tail Adaptive Transportation Inequalities with the Rényi Divergence
Preliminaries for the Rényi Divergence
Transportation Inequalities with Rényi Divergence.
Transportation Inequality Transfer From Proxy to Golden Reward
Conclusion
Broader Impact and Limitations
Proofs For Best of $n$ Policy
...and 11 more sections

Key Result

Lemma 1

Let $E\sim \exp(1)$, and $E_1,\dots E_n$ be iid exponentials and $E^{(n)}$ their maximum, we have:

Figures (1)

Figure 1: Qualitiative plot of centered rewards vs. KL of Proxy and Gold Rewards for both Best of $n$ and RL policies. (See Fig. 1 a) and b) in gao2023scaling for scaling laws in policy alignment).

Theorems & Definitions (36)

Lemma 1: $\mathsf{KL}$ Between Exponential and Maximum of Exponentials
Theorem 1
proof
Proposition 1
Proposition 2
Proposition 3: Transportation Inequalities
Corollary 1: Expected Reward Improvement
Remark 1
Theorem 2: High Probability Empirical Reward Improvement For RL
Theorem 3
...and 26 more

Information Theoretic Guarantees For Policy Alignment In Large Language Models

TL;DR

Abstract

Information Theoretic Guarantees For Policy Alignment In Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (36)