Table of Contents
Fetching ...

Reward Generalization in RLHF: A Topological Perspective

Tianyi Qiu, Fanzhi Zeng, Jiaming Ji, Dong Yan, Kaile Wang, Jiayi Zhou, Yang Han, Josef Dai, Xuehai Pan, Yaodong Yang

TL;DR

This work analyzes reward generalization in RLHF through the topology of information flow, introducing macro-level autoencoding and micro-level induced Bayesian networks (IBN) to understand how preference data shapes RM and LM behavior. It proposes reward modeling from tree-structured preference information, showing a theoretical reduction in reward uncertainty by up to $Θ\left(\frac{\log n}{\log\log n}\right)$ and empirical improvements (~65% win rate) across three NLP tasks. The tree-based topology encodes richer dependencies than chain-based data, enabling more data-efficient learning and better generalization of rewards across diverse contexts. Together, the macro- and micro-level perspectives provide a principled pathway to design reward topologies that improve RLHF without increasing annotation effort, with practical implications for safer, more reliable LLMs.

Abstract

Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theory of reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks to model the impact of dataset topologies on reward generalization. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to $Θ(\log n/\log\log n)$ times compared to baselines, where $n$ is the dataset size. Validation on three NLP tasks shows that it achieves an average win rate of 65% against baselines, thus improving reward generalization for free via topology design, while reducing the amount of data requiring annotation.

Reward Generalization in RLHF: A Topological Perspective

TL;DR

This work analyzes reward generalization in RLHF through the topology of information flow, introducing macro-level autoencoding and micro-level induced Bayesian networks (IBN) to understand how preference data shapes RM and LM behavior. It proposes reward modeling from tree-structured preference information, showing a theoretical reduction in reward uncertainty by up to and empirical improvements (~65% win rate) across three NLP tasks. The tree-based topology encodes richer dependencies than chain-based data, enabling more data-efficient learning and better generalization of rewards across diverse contexts. Together, the macro- and micro-level perspectives provide a principled pathway to design reward topologies that improve RLHF without increasing annotation effort, with practical implications for safer, more reliable LLMs.

Abstract

Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theory of reward generalization in reinforcement learning from human feedback (RLHF), focusing on the topology of information flow at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present induced Bayesian networks to model the impact of dataset topologies on reward generalization. Combining analysis on both levels, we propose reward modeling from tree-structured preference information. It is shown to reduce reward uncertainty by up to times compared to baselines, where is the dataset size. Validation on three NLP tasks shows that it achieves an average win rate of 65% against baselines, thus improving reward generalization for free via topology design, while reducing the amount of data requiring annotation.
Paper Structure (55 sections, 15 theorems, 95 equations, 7 figures, 12 tables, 2 algorithms)

This paper contains 55 sections, 15 theorems, 95 equations, 7 figures, 12 tables, 2 algorithms.

Key Result

theorem 1

If the reward modeling process (i.e., the encoding process) satisfies that and policy optimization (i.e., the decoding process) performs $\beta$-entropy-regularized RL, i.e., then, uniformly for all $(y_1,y_2)\in\gY^2$ and for all $y\in\gY$.

Figures (7)

  • Figure 1: The RLHF process is conceptualized as an autoencoding process. Encoding: Human preferences are compressed into the RM through data collection and preference labeling followed by RM training. Decoding: The reinforcement learning process restores a language model policy based on reward signals from the reward model. The entire process aims to achieve consistency between human preference and model behavior.
  • Figure 2: Tree-based and chain-based information topologies of the preference dataset $D$. The root node represents the shared prompt, while a Text node represents a segment of text serving as a constituent of full responses. The chain-based topology, highlighted in red, generates responses independently. The tree-based topology, highlighted in blue, generates a prefix tree (where root-to-leaf paths correspond to full responses) instead of independent responses, creating a dependence structure among the resulting responses. See Appendix \ref{['app:tree-chain-examples']} for examples.
  • Figure 3: The induced Bayesian network (IBN) that models reward generalization. Nodes represent possible responses, and edges represent reward correlations due to inductive biases (black) or pairwise comparison data (purple), each associated with a conditional reward distribution. Thick segments mark an inference path, providing evidence on the preferability of $y_2$ compared to $y_1$. Dashed curves carve out clustering structures.
  • Figure 4: RFT results for different preference dataset settings. In our tree-structured QA datasets, responses are labeled as complete or incomplete depending on whether they extend from the root to a leaf or an interval node (see Appendix \ref{['appendix:annotation']} for details).
  • Figure 5: Comparison of models fine-tuned by PPO with tree-based and chain-based RMs across 7 epochs.
  • ...and 2 more figures

Theorems & Definitions (36)

  • theorem 1
  • definition 1: Induced Bayesian Network
  • remark 1: RM Inference and IBN Inference are Analogous
  • definition 2: Structural Function
  • remark 2: Intuition on the Structural Function
  • theorem 2: RM Uncertainty in Chain-Based and Tree-Based Datasets
  • corollary 1
  • definition 3: Hypothesis Distribution
  • definition 4: Inductive Bias Edges
  • definition 5: Induced Bayesian Network
  • ...and 26 more