Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models

Roberto-Rafael Maura-Rivero; Chirag Nagpal; Roma Patel; Francesco Visin

Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models

Roberto-Rafael Maura-Rivero, Chirag Nagpal, Roma Patel, Francesco Visin

TL;DR

The paper presents Inada-inspired Reward Transformation (IRT) to replace linear reward aggregation in RLHF for large language models. By transforming individual reward signals before aggregation, IRT emphasizes very low rewards and de-emphasizes already-satisfactory high rewards, guided by Inada conditions and CRRA utility concepts. Empirical results on Gemma 2B show that applying IRT—particularly as a Partial IRT for harmlessness with optimized parameters—yields higher harmlessness and comparable helpfulness, with reduced reward hacking and negligible training overhead. This approach demonstrates the value of economics-inspired utility shaping for more robust and safer alignment of language models to human preferences. The work suggests a promising direction for integrating economic theory into RLHF and highlights opportunities for learning thresholds and broader utility-function forms in future research.

Abstract

Current methods that train large language models (LLMs) with reinforcement learning feedback, often resort to averaging outputs of multiple rewards functions during training. This overlooks crucial aspects of individual reward dimensions and inter-reward dependencies that can lead to sub-optimal outcomes in generations. In this work, we show how linear aggregation of rewards exhibits some vulnerabilities that can lead to undesired properties of generated text. We then propose a transformation of reward functions inspired by economic theory of utility functions (specifically Inada conditions), that enhances sensitivity to low reward values while diminishing sensitivity to already high values. We compare our approach to the existing baseline methods that linearly aggregate rewards and show how the Inada-inspired reward feedback is superior to traditional weighted averaging. We quantitatively and qualitatively analyse the difference in the methods, and see that models trained with Inada-transformations score as more helpful while being less harmful.

Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models

TL;DR

Abstract

Paper Structure (47 sections, 6 equations, 3 figures, 13 tables)

This paper contains 47 sections, 6 equations, 3 figures, 13 tables.

Introduction
Background & Preliminaries
Reinforcement Learning from Human Feedback (RLHF)
1. Supervised Fine-Tuning (SFT):
2. Reward Model Training:
3. Reinforcement Learning (RL) Fine-tuning:
Economic Theory
Utility Functions
Inada Conditions & Shaping Utilities
Relative Risk Aversion Utility Functions
Inada Inspired Reward Aggregation
Limitations of Linear Aggregation
Insensitivity to critically low rewards
Over-prioritizing high rewards
An Inada-Inspired Utility Function
...and 32 more sections

Figures (3)

Figure 1: Linear reward aggregation(a) and (b) show two different responses with different helpful and harmful ratings (green and red), but same aggregated reward (blue). Note that the response in (a) is rated satisfactorily helpful (above minimum helpfulness threshold, depicted as a green dotted line), but also dangerously harmful (below maximum harmfulness threshold, depicted as a dotted red line), while the one in (b) is not beyond the harmfulness threshold while remaining satisfactorily helpful.
Figure 2: Impact of the three hyperparameters of the Inada Reward Transformation. The reward threshold ($\tau$) determines the point of application of the reward transformations governed by the other two hyper-parameters. A larger penalty factor ($\beta$) amplifies the negative impact of rewards below the threshold, while a higher diminishing returns ($\gamma$) de-emphasize gains in already satisfactory values.
Figure 3: The Inada Reward Transformation (IRT). Rewards above the helpfulness threshold get discounted (intuitively, once the answer is helpful there is little gain making it more helpful), while rewards below the harmfulness threshold get further penalised. As a result, the aggregated reward in (a) is much lower than the one in (b), allowing to differentiate between the two cases.

Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models

TL;DR

Abstract

Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)