Learning Explainable Dense Reward Shapes via Bayesian Optimization
Ryan Koo, Ian Yang, Vipul Raheja, Mingyi Hong, Kwang-Sung Jun, Dongyeop Kang
TL;DR
This work tackles the sparse token-level feedback problem in RLHF for language models by proposing explainability-based token rewards and a bi-level optimization framework to learn shaping weights. A dense reward is constructed as $r'(s,a)=W^\top \boldsymbol{\mathcal{E}}\cdot r(s,a)$, where token-level attributions $\boldsymbol{\mathcal{E}}$ (e.g., SHAP/LIME) are blended with the original reward under weights $W$ that sum to 1; Bayesian Optimization guides the search over $W$ using a Gaussian Process and log Noisy Expected Improvement. The outer loop optimizes the reward-shaping weights, while the inner loop solves the standard RLHF objective, yielding a policy that remains optimal for the original reward due to potential-based shaping (policy invariance). Empirically, explainable reward shaping accelerates learning and improves downstream task performance, while BO helps balance conflicting token-level signals, albeit with limitations in search efficiency under limited trials. The approach offers a principled path to denser credit assignment in RLHF and highlights the importance of evolving explainability signals and hyperparameter-tuning strategies for robust LLM alignment.
Abstract
Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.
