Table of Contents
Fetching ...

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao

TL;DR

This work introduces InfoRM, an information-theoretic reward modeling framework for RLHF that uses a variational information bottleneck objective $I(\boldsymbol{S};\boldsymbol{Y}) - \beta I(\boldsymbol{X};\boldsymbol{S}|\boldsymbol{Y})$ to filter out preference-irrelevant information and improve generalization against reward misgeneralization. A key innovation is the observation that reward overoptimization correlates with outliers in the IB latent space, leading to the Cluster Separation Index (CSI) for online detection and potential mitigation such as early stopping. Through extensive simulation and real-world RLHF experiments across RM scales from 70M to 7B, InfoRM demonstrates improved stability, reduced overoptimization, and better generalization to distribution shifts, while preserving or enhancing gold-objective performance. The approach provides a practical, model-agnostic tool that complements KL-based methods and offers a data-driven signal for online mitigation, with broad implications for safer and more robust RLHF deployments.

Abstract

Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models with human values, reward hacking, also termed reward overoptimization, remains a critical challenge. This issue primarily arises from reward misgeneralization, where reward models (RMs) compute reward using spurious features that are irrelevant to human preferences. In this work, we tackle this problem from an information-theoretic perspective and propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective to filter out irrelevant information. Notably, we further identify a correlation between overoptimization and outliers in the IB latent space of InfoRM, establishing it as a promising tool for detecting reward overoptimization. Inspired by this finding, we propose the Cluster Separation Index (CSI), which quantifies deviations in the IB latent space, as an indicator of reward overoptimization to facilitate the development of online mitigation strategies. Extensive experiments on a wide range of settings and RM scales (70M, 440M, 1.4B, and 7B) demonstrate the effectiveness of InfoRM. Further analyses reveal that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets, signifying a notable advancement in the field of RLHF. The code will be released upon acceptance.

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

TL;DR

This work introduces InfoRM, an information-theoretic reward modeling framework for RLHF that uses a variational information bottleneck objective to filter out preference-irrelevant information and improve generalization against reward misgeneralization. A key innovation is the observation that reward overoptimization correlates with outliers in the IB latent space, leading to the Cluster Separation Index (CSI) for online detection and potential mitigation such as early stopping. Through extensive simulation and real-world RLHF experiments across RM scales from 70M to 7B, InfoRM demonstrates improved stability, reduced overoptimization, and better generalization to distribution shifts, while preserving or enhancing gold-objective performance. The approach provides a practical, model-agnostic tool that complements KL-based methods and offers a data-driven signal for online mitigation, with broad implications for safer and more robust RLHF deployments.

Abstract

Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models with human values, reward hacking, also termed reward overoptimization, remains a critical challenge. This issue primarily arises from reward misgeneralization, where reward models (RMs) compute reward using spurious features that are irrelevant to human preferences. In this work, we tackle this problem from an information-theoretic perspective and propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective to filter out irrelevant information. Notably, we further identify a correlation between overoptimization and outliers in the IB latent space of InfoRM, establishing it as a promising tool for detecting reward overoptimization. Inspired by this finding, we propose the Cluster Separation Index (CSI), which quantifies deviations in the IB latent space, as an indicator of reward overoptimization to facilitate the development of online mitigation strategies. Extensive experiments on a wide range of settings and RM scales (70M, 440M, 1.4B, and 7B) demonstrate the effectiveness of InfoRM. Further analyses reveal that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets, signifying a notable advancement in the field of RLHF. The code will be released upon acceptance.
Paper Structure (40 sections, 1 theorem, 21 equations, 29 figures, 3 tables, 3 algorithms)

This paper contains 40 sections, 1 theorem, 21 equations, 29 figures, 3 tables, 3 algorithms.

Key Result

Theorem 1

Let $|S|$ be the cardinality of the latent representation space of InfoRM, $l(\cdot)$ be the loss function following sub-$\sigma$-Gaussian distribution, $X$ be the reward model input, $S$ be the latent representation of InfoRM, and $\Theta$ be the network parameters, we have the following upper boun where $L$, $\eta$, and $n$ are the effective number of layers causing information loss, a constant

Figures (29)

  • Figure 1: Comparison between standard RM and our information-theoretic reward model (InfoRM). InfoRM distinguishes itself by enhancing RM generalizability through mutual information modeling. Additionally, a distinct feature of InfoRM is its overoptimization detection mechanism, which can guide parameter selection and algorithm design in subsequent RLHF. Specifically, the RM encoder is derived from the standard RM, with modification to the final layer.
  • Figure 2: Response comparison on Anthropic-Helpful between RLHF models using our InfoRM and other baselines, assessed by GPT-4, demonstrating the superior performance of our method.
  • Figure 3: An example of reward overoptimization in RLHF characterized by a declining gold score (i.e., actual human preference) and a rising proxy score (i.e., proxy RM preference).
  • Figure 4: Simulated RLHF results for different proxy RMs (1.4B). Solid and dashed lines represent the gold and proxy scores, respectively. In later RL stages, as KL divergence increases, Standard RM shows a declining gold score and a rising proxy score, indicating overoptimization. Conversely, our InfoRM maintains consistent growth in both scores, effectively mitigating overoptimization.
  • Figure 5: Final gold rewards in simulated RLHF experiments. Left: Using proxy RMs with varying parameter sizes. Right: Conducting RL on Alpaca (in-distribution) and Flan (out-of-distribution). The proxy RMs are all trained on the same simulated preference dataset with 25% label noise.
  • ...and 24 more figures

Theorems & Definitions (1)

  • Theorem 1