Table of Contents
Fetching ...

Learning Explainable Dense Reward Shapes via Bayesian Optimization

Ryan Koo, Ian Yang, Vipul Raheja, Mingyi Hong, Kwang-Sung Jun, Dongyeop Kang

TL;DR

This work tackles the sparse token-level feedback problem in RLHF for language models by proposing explainability-based token rewards and a bi-level optimization framework to learn shaping weights. A dense reward is constructed as $r'(s,a)=W^\top \boldsymbol{\mathcal{E}}\cdot r(s,a)$, where token-level attributions $\boldsymbol{\mathcal{E}}$ (e.g., SHAP/LIME) are blended with the original reward under weights $W$ that sum to 1; Bayesian Optimization guides the search over $W$ using a Gaussian Process and log Noisy Expected Improvement. The outer loop optimizes the reward-shaping weights, while the inner loop solves the standard RLHF objective, yielding a policy that remains optimal for the original reward due to potential-based shaping (policy invariance). Empirically, explainable reward shaping accelerates learning and improves downstream task performance, while BO helps balance conflicting token-level signals, albeit with limitations in search efficiency under limited trials. The approach offers a principled path to denser credit assignment in RLHF and highlights the importance of evolving explainability signals and hyperparameter-tuning strategies for robust LLM alignment.

Abstract

Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.

Learning Explainable Dense Reward Shapes via Bayesian Optimization

TL;DR

This work tackles the sparse token-level feedback problem in RLHF for language models by proposing explainability-based token rewards and a bi-level optimization framework to learn shaping weights. A dense reward is constructed as , where token-level attributions (e.g., SHAP/LIME) are blended with the original reward under weights that sum to 1; Bayesian Optimization guides the search over using a Gaussian Process and log Noisy Expected Improvement. The outer loop optimizes the reward-shaping weights, while the inner loop solves the standard RLHF objective, yielding a policy that remains optimal for the original reward due to potential-based shaping (policy invariance). Empirically, explainable reward shaping accelerates learning and improves downstream task performance, while BO helps balance conflicting token-level signals, albeit with limitations in search efficiency under limited trials. The approach offers a principled path to denser credit assignment in RLHF and highlights the importance of evolving explainability signals and hyperparameter-tuning strategies for robust LLM alignment.

Abstract

Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.

Paper Structure

This paper contains 23 sections, 1 theorem, 21 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Given a policy $\pi_\theta$ and a Markov Decision Process $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}, \gamma, R)$, any reward shaping function $R'$ in the family of additive feature attribution methods follows a potential-based shaping function and then the optimal $\pi_\theta$ for $R'$ i

Figures (7)

  • Figure 1: Overview of the bilevel optimization setup to find the best reward shape and the optimal policy. The pipeline involves an outer and inner training loop, where the outer step optimizes the Bayesian optimization model and samples the weights for our reward shape. The inner step optimizes the classic RLHF objective.
  • Figure 2: Redistribution sequence of the scalar reward prediction over the explanation feature attributions after softmax normalization. A darker red highlights a much stronger positive contribution, while a deeper blue indicates a more negative contribution.
  • Figure 3: (Left) The mean training reward per timestep with increasing BO dimensionality. (Middle) The mean validation reward over each BO trial. The highlighted dots indicate the best validation reward received at trial $n$. (Right) The average dense reward attribution over each trial for SHAP + ATTN. The highlighted row indicates the shape in trial 17 that received the highest validation reward.
  • Figure 4: Helpfulness. (Left) The average training reward per timestep. (Right) The average value head loss per timestep. The shading represents the standard error ($95\%$ confidence interval) as training progresses.
  • Figure 5: (Top) The weight transition between trials for SHAPley scores. (Bottom) The weight transition between trials for LIME scores. The black boxes indicate the "best weights" sampled by the BO model.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Example 1
  • Proposition 1
  • proof