Table of Contents
Fetching ...

Alternating Reinforcement Learning with Contextual Rubric Rewards

Guangchen Lan

Abstract

Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).

Alternating Reinforcement Learning with Contextual Rubric Rewards

Abstract

Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).
Paper Structure (45 sections, 3 theorems, 21 equations, 7 figures, 10 tables)

This paper contains 45 sections, 3 theorems, 21 equations, 7 figures, 10 tables.

Key Result

Theorem 3.1

The variance of the scalarized reward $R$ is strictly less than the variance of any individual meta-class reward $R_m$, provided that $\rho < 1$: where $\sigma^2$ is the variance and $\rho$ is the correlation coefficient.

Figures (7)

  • Figure 1: Evaluation score comparison of Alternating RL and Scalarized RL across different actor model sizes.
  • Figure 2: Evaluation score comparison of ARL and SRL across different reward models. The actor model is Qwen3-4B in all evaluations. The lines in light 0.4red and 0.4blue colors are evaluated by the same RM used in training, while the lines in dark red and blue colors are evaluated by the large Qwen3-32B model.
  • Figure 3: Evaluation results of scalarized RL and alternating RL with three different meta-class orders (Order 0, 1, 2).
  • Figure 4: Schematic of the Meta-Class Searching. Starting from the initial policy $\pi_{0}$, the nodes in orange color are searching with $p$ percentage of data, and the nodes in green color are training with the full data.
  • Figure 5: Evaluation score comparison on the Qwen3-4B actor model with different searching percentages. w/o denotes the performance without the searching method.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Theorem 3.1: Variance contraction
  • Theorem 3.2: Variance contraction
  • proof
  • Corollary 3.3: Variance contraction with equal weights