BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

Gaurav Pandey; Yatin Nandwani; Tahira Naseem; Mayank Mishra; Guangxuan Xu; Dinesh Raghu; Sachindra Joshi; Asim Munawar; Ramón Fernandez Astudillo

BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

Gaurav Pandey, Yatin Nandwani, Tahira Naseem, Mayank Mishra, Guangxuan Xu, Dinesh Raghu, Sachindra Joshi, Asim Munawar, Ramón Fernandez Astudillo

TL;DR

BRAIn tackles the high gradient variance in distribution-matching RLHF by formulating a reward-conditioned posterior $p(y|x,G=1)$ via Bayes' rule from the prior $p(y|x)$ and reward model $p(G=1|y,x)$. It introduces a self-normalized baseline to derive the BRAIn gradient, an unbiased estimator of a self-normalized KL divergence, and shows that DPO-sft is a special case of BRAIn under certain constraints. The Bradley-Terry modeling of preferences links the posterior to the PPO-optimal policy, bridging distribution-matching and DPO, and the method achieves state-of-the-art results on TL;DR and Anthropic HH tasks. Extensive ablations reveal the importance of the self-normalized baseline, and demonstrate that increasing the number of outputs per prompt and relaxing restrictive assumptions further improvements beyond traditional DPO. All mathematical expressions are kept in $...$ delimiters, and the key ideas are expressed with precise probabilistic and information-theoretic formulations to support reproducibility and SEO-friendly indexing.

Abstract

Distribution matching methods for language model alignment such as Generation with Distributional Control (GDC) and Distributional Policy Gradient (DPG) have not received the same level of attention in reinforcement learning from human feedback (RLHF) as contrastive methods such as Sequence Likelihood Calibration (SLiC), Direct Preference Optimization (DPO) and its variants. We identify high variance of the gradient estimate as the primary reason for the lack of success of these methods and propose a self-normalized baseline to reduce the variance. We further generalize the target distribution in DPG, GDC and DPO by using Bayes' rule to define the reward-conditioned posterior. The resulting approach, referred to as BRAIn - Bayesian Reward-conditioned Amortized Inference acts as a bridge between distribution matching methods and DPO and significantly outperforms prior art in summarization and Antropic HH tasks.

BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

TL;DR

BRAIn tackles the high gradient variance in distribution-matching RLHF by formulating a reward-conditioned posterior

via Bayes' rule from the prior

and reward model

. It introduces a self-normalized baseline to derive the BRAIn gradient, an unbiased estimator of a self-normalized KL divergence, and shows that DPO-sft is a special case of BRAIn under certain constraints. The Bradley-Terry modeling of preferences links the posterior to the PPO-optimal policy, bridging distribution-matching and DPO, and the method achieves state-of-the-art results on TL;DR and Anthropic HH tasks. Extensive ablations reveal the importance of the self-normalized baseline, and demonstrate that increasing the number of outputs per prompt and relaxing restrictive assumptions further improvements beyond traditional DPO. All mathematical expressions are kept in

delimiters, and the key ideas are expressed with precise probabilistic and information-theoretic formulations to support reproducibility and SEO-friendly indexing.

Abstract

Paper Structure (24 sections, 9 theorems, 40 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 9 theorems, 40 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Notation
Approach
A formal justification of the BRAIn gradient estimate:
Connection with existing RLHF methods
Posterior and PPO-optimal policy
Bradley-Terry Preference Model for LLMs
DPO-sft as a special case of BRAIn
Experimental Setup
Experimental Results
Comparison with baselines
KL-reward frontier
Role played by self-normalized baseline
Bridging the gap between DPO-sft and BRAIn
...and 9 more sections

Key Result

Theorem 4.1

The BRAIn gradient estimate defined in eq:grad_brain is an unbiased estimator of the gradient (w.r.t.$\theta$) of negative self-normalized KL--divergence between the posterior $p({y} | x, G=1)$ and training policy $q_{\theta}({y} | x)$ defined in eq:snkl. Here, the dependence of KL divergence on $\t

Figures (4)

Figure 1: BRAIn acts as a bridge between distribution matching methods (GDC khalifa2020gdc and GDC++ korbak2022distributionmatching) and DPO rafailov2023direct, specifically DPO-sft where the samples come from the base/SFT policy. The values $\alpha_i, \hat{\alpha_i}$ and $\hat{\beta_i}$ are as defined in equations \ref{['eq:alpha']} and \ref{['eq:beta']} whereas $Z$ is the normalization constant of the target. Note that the proposal distribution in the distribution matching methods and BRAIn is chosen differently.
Figure 2: The KL-reward frontier of BRAIn and DPO-sft
Figure 3: Variance of GDC, GDC++ and BRAIn gradient estimates
Figure 4: Plot of Win-rate Against Gold as a function of the Number of Samples per Prompt.

Theorems & Definitions (16)

Definition 4.0
Theorem 4.1
Theorem 4.2
Proposition 5.0
Theorem 5.1
Theorem 5.2
proof
Definition 1.0
Theorem 1.1
proof
...and 6 more

BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

TL;DR

Abstract

BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (16)