SCAR: Shapley Credit Assignment for More Efficient RLHF
Meng Cao, Shuyuan Zhang, Xiao-Wen Chang, Doina Precup
TL;DR
SCAR addresses the sparse reward problem in RLHF by assigning sequence-level rewards to text units using Shapley values, creating dense, principled credit signals without extra annotations. The method frames generation as a cooperative game where units (tokens or spans) are players and uses a Shapley-based decomposition to produce unit-level rewards that sum to the original final reward, preserving the optimal policy via potential-based reward shaping. To stay practical, SCAR employs adaptive segmentation (tokens, spans, or sentences) and Owen-value approximations to reduce computational burden, enabling scalable credit attribution. Empirically, SCAR accelerates learning and yields higher final rewards across sentiment control, summarization, and instruction tuning, with span-level segmentation offering a favorable balance of efficiency and performance.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is a widely used technique for aligning Large Language Models (LLMs) with human preferences, yet it often suffers from sparse reward signals, making effective credit assignment challenging. In typical setups, the reward model provides a single scalar score for an entire generated sequence, offering little insight into which token or span-level decisions were responsible for the outcome. To address this, we propose Shapley Credit Assignment Rewards (SCAR), a novel method that leverages Shapley values in cooperative game theory. SCAR distributes the total sequence-level reward among constituent tokens or text spans based on their principled marginal contributions. This creates dense reward signals, crucially, without necessitating the training of auxiliary critique models or recourse to fine-grained human annotations at intermediate generation stages. Unlike prior dense reward methods, SCAR offers a game-theoretic foundation for fair credit attribution. Theoretically, we demonstrate that SCAR preserves the original optimal policy, and empirically, across diverse tasks including sentiment control, text summarization, and instruction tuning, we show that SCAR converges significantly faster and achieves higher final reward scores compared to standard RLHF and attention-based dense reward baselines. Our findings suggest that SCAR provides a more effective and theoretically sound method for credit assignment in RLHF, leading to more efficient alignment of LLMs.
