Table of Contents
Fetching ...

SCAR: Shapley Credit Assignment for More Efficient RLHF

Meng Cao, Shuyuan Zhang, Xiao-Wen Chang, Doina Precup

TL;DR

SCAR addresses the sparse reward problem in RLHF by assigning sequence-level rewards to text units using Shapley values, creating dense, principled credit signals without extra annotations. The method frames generation as a cooperative game where units (tokens or spans) are players and uses a Shapley-based decomposition to produce unit-level rewards that sum to the original final reward, preserving the optimal policy via potential-based reward shaping. To stay practical, SCAR employs adaptive segmentation (tokens, spans, or sentences) and Owen-value approximations to reduce computational burden, enabling scalable credit attribution. Empirically, SCAR accelerates learning and yields higher final rewards across sentiment control, summarization, and instruction tuning, with span-level segmentation offering a favorable balance of efficiency and performance.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is a widely used technique for aligning Large Language Models (LLMs) with human preferences, yet it often suffers from sparse reward signals, making effective credit assignment challenging. In typical setups, the reward model provides a single scalar score for an entire generated sequence, offering little insight into which token or span-level decisions were responsible for the outcome. To address this, we propose Shapley Credit Assignment Rewards (SCAR), a novel method that leverages Shapley values in cooperative game theory. SCAR distributes the total sequence-level reward among constituent tokens or text spans based on their principled marginal contributions. This creates dense reward signals, crucially, without necessitating the training of auxiliary critique models or recourse to fine-grained human annotations at intermediate generation stages. Unlike prior dense reward methods, SCAR offers a game-theoretic foundation for fair credit attribution. Theoretically, we demonstrate that SCAR preserves the original optimal policy, and empirically, across diverse tasks including sentiment control, text summarization, and instruction tuning, we show that SCAR converges significantly faster and achieves higher final reward scores compared to standard RLHF and attention-based dense reward baselines. Our findings suggest that SCAR provides a more effective and theoretically sound method for credit assignment in RLHF, leading to more efficient alignment of LLMs.

SCAR: Shapley Credit Assignment for More Efficient RLHF

TL;DR

SCAR addresses the sparse reward problem in RLHF by assigning sequence-level rewards to text units using Shapley values, creating dense, principled credit signals without extra annotations. The method frames generation as a cooperative game where units (tokens or spans) are players and uses a Shapley-based decomposition to produce unit-level rewards that sum to the original final reward, preserving the optimal policy via potential-based reward shaping. To stay practical, SCAR employs adaptive segmentation (tokens, spans, or sentences) and Owen-value approximations to reduce computational burden, enabling scalable credit attribution. Empirically, SCAR accelerates learning and yields higher final rewards across sentiment control, summarization, and instruction tuning, with span-level segmentation offering a favorable balance of efficiency and performance.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is a widely used technique for aligning Large Language Models (LLMs) with human preferences, yet it often suffers from sparse reward signals, making effective credit assignment challenging. In typical setups, the reward model provides a single scalar score for an entire generated sequence, offering little insight into which token or span-level decisions were responsible for the outcome. To address this, we propose Shapley Credit Assignment Rewards (SCAR), a novel method that leverages Shapley values in cooperative game theory. SCAR distributes the total sequence-level reward among constituent tokens or text spans based on their principled marginal contributions. This creates dense reward signals, crucially, without necessitating the training of auxiliary critique models or recourse to fine-grained human annotations at intermediate generation stages. Unlike prior dense reward methods, SCAR offers a game-theoretic foundation for fair credit attribution. Theoretically, we demonstrate that SCAR preserves the original optimal policy, and empirically, across diverse tasks including sentiment control, text summarization, and instruction tuning, we show that SCAR converges significantly faster and achieves higher final reward scores compared to standard RLHF and attention-based dense reward baselines. Our findings suggest that SCAR provides a more effective and theoretically sound method for credit assignment in RLHF, leading to more efficient alignment of LLMs.

Paper Structure

This paper contains 30 sections, 1 theorem, 10 equations, 5 figures, 5 tables.

Key Result

Theorem 3.1

Consider a parameterized language model $\pi_\theta$ with a learned reward model $R_\phi$. Let $\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R^{\text{orig}}_t, \gamma)$ be the original MDP with its reward from the reward model and $\widehat{\mathcal{M}} = (\mathcal{S}, \mathcal{A}, P, R_t(\alpha), \g

Figures (5)

  • Figure 1: Comparison of reward distribution strategies for an example generated sequence. Sparse RLHF assigns the total reward at the end. SCAR and ABC distribute this reward across tokens/spans based on their respective methodologies, shown with background highlights (color hue for sign, intensity for magnitude; more intense/saturated means higher absolute contribution) and numerical scores.
  • Figure 2: Average reward per timestep during RLHF training for sentiment control (left), text summarization (center), and instruction tuning (right). Curves show the mean reward across five random seeds, with shaded regions representing the standard deviation. SCAR consistently demonstrates faster convergence and achieves higher or comparable final reward levels compared to sparse RLHF, Uniform reward distribution, and Attention-Based Credit (ABC) baselines.
  • Figure 3: Reward-KL tradeoff on the sentiment control (IMDB) task. The y-axis represents the average per-batch reward during training and the x-axis shows the square root of the KL divergence between the learned policy ($\pi$) and the reference policy ($\pi_{\text{ref}}$).
  • Figure 4: Rewards/GPU hours curves on the TL;DR dataset. We sampled one run from each method. The y-axis represents the reward and the x-axis shows the GPU hours used for training.
  • Figure 5: Comparison between token-level and span-level SCAR on the text summarization (TL;DR) task. The y-axis represents the reward during training, and the x-axis shows the training timestep.

Theorems & Definitions (2)

  • Theorem 3.1: Policy Invariance under SCAR Reward Shaping
  • proof