Table of Contents
Fetching ...

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

Tianze Xu, Yanzhao Zheng, Pengrui Lu, Lyumanshan Ye, Yong Wu, Zhentao Zhang, Yuanqiang Yu, Chao Ma, Jihuai Zhu, Pengfei Liu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu

Abstract

Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to predict which tokens in the response are responsible for a specific constraint, and optimizes the policy model via RTT-GRPO, which integrates response-level and token-level advantages within a unified framework. Furthermore, when transitioning from one-dimensional, outcome-level reward to three-dimensional reward space in the token-level rubric-based RL, we propose a novel group normalization method, called Intra-sample Token Group Normalization, to accommodate this shift. Extensive experiments and benchmarks demonstrate that RTT consistently outperforms other baselines in both instruction- and rubric-level accuracy across different models.

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

Abstract

Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to predict which tokens in the response are responsible for a specific constraint, and optimizes the policy model via RTT-GRPO, which integrates response-level and token-level advantages within a unified framework. Furthermore, when transitioning from one-dimensional, outcome-level reward to three-dimensional reward space in the token-level rubric-based RL, we propose a novel group normalization method, called Intra-sample Token Group Normalization, to accommodate this shift. Extensive experiments and benchmarks demonstrate that RTT consistently outperforms other baselines in both instruction- and rubric-level accuracy across different models.

Paper Structure

This paper contains 61 sections, 18 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Architecture of the RTT framework. The policy model first generates a response group for instructions. The top computes ${\hat{A}_{res}^{(i,t)}}$ by normalizing rewards across the sampled responses. The bottom uses a token-level discriminator to map rewards to specific tokens based on rubrics, applying Intra-Sample Group Normalization to calculate ${\hat{A}_{tok}^{(i,t)}}$. These two components are merged into the final aggregated advantage ${\hat{A}_{sum}^{(i,t)}}$ to drive the policy update.
  • Figure 2: Reward structure under standard GRPO and RTT-GRPO. In GRPO (left), each of $G$ sampled completions receives one outcome-level reward, advantages are normalized along the sample axis only. The RTT-GRPO (right) assigns rewards over three dimensions, motivating two group normalization strategies, inter-sample and intra-sample group normalization.
  • Figure 3: Sensitivity of RTT to $\beta$. Baseline $\beta=0$ is shown as a horizontal line.
  • Figure 4: Training dynamics of RTT-CSR vs. RL-CSR baseline. Left side shows model performance on IFEval, IFBench, and MulDimIF benchmarks, while right side shows response entropy.
  • Figure 5: Training stability comparison between RTT-CSR and RL-CSR on Llama3.2-3B-Instruct. We report rollout accuracy, entropy, KL loss, and clip fraction over training. RL-CSR exhibits severe instability, while RTT-CSR remains stable.