Table of Contents
Fetching ...

REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

Yasi Zhang, Tianyu Chen, Mingyuan Zhou, Oscar Leong, Ying Nian Wu, Michal Lukasik

Abstract

Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose \textbf{REAL} (\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-dependent, thus invalidating standard policy gradient methods. To address this, we employ the generalized policy gradient estimator, which naturally decomposes optimization into two complementary components: (1) exploration over Chain-of-Thought (CoT) trajectory, and (2) regression-aware prediction refinement of the final score. Extensive experiments across model scales (8B to 32B) demonstrate that REAL consistently outperforms both regression-aware SFT baselines and standard RL methods, exhibiting significantly better generalization on out-of-domain benchmarks. On Qwen3-32B specifically, we achieve gains of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model. These findings highlight the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation.

REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

Abstract

Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose \textbf{REAL} (\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-dependent, thus invalidating standard policy gradient methods. To address this, we employ the generalized policy gradient estimator, which naturally decomposes optimization into two complementary components: (1) exploration over Chain-of-Thought (CoT) trajectory, and (2) regression-aware prediction refinement of the final score. Extensive experiments across model scales (8B to 32B) demonstrate that REAL consistently outperforms both regression-aware SFT baselines and standard RL methods, exhibiting significantly better generalization on out-of-domain benchmarks. On Qwen3-32B specifically, we achieve gains of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model. These findings highlight the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation.
Paper Structure (56 sections, 4 theorems, 42 equations, 4 figures, 11 tables, 2 algorithms)

This paper contains 56 sections, 4 theorems, 42 equations, 4 figures, 11 tables, 2 algorithms.

Key Result

Lemma 3.1

Consider the following distributions: the distribution over input prompts $x \sim P_{\mathcal{D}}(.)$, the distribution over chains of thought (CoTs) $c \sim \pi_{\theta}(. \mid x)$ generated by the LLM policy $\pi_{\theta}$, and the distribution over targets conditioned on the inputs, $y^* \sim P(.

Figures (4)

  • Figure 1: Overview of the REAL framework. REAL addresses the limitations of standard RL in LLM-as-a-Judge tasks by optimizing a policy-dependent regression reward. The framework employs a generalized policy gradient that leads to a gradient update that decomposes into two terms: (1) Exploration Over Reasoning Trajectory; and (2) Regression-Aware Prediction Refinement. This enables principled optimization of ordinal structures that standard RL with binary rewards typically ignores. The full algorithm is in Alg. \ref{['alg:real']}.
  • Figure 2: Evaluation performance during RL training. Both standard RL with binary reward (i.e., $r_{\text{acc}} = \mathbf{1}(y = y^*)$) and REAL with regression-aware reward (i.e., Eq. \ref{['eq:regression-reward']}) were initialized from the SOTA SFT checkpoint, i.e., TRACT chiang-etal-2025-tract. Standard RL results in suboptimal performance in correlation metrics compared to our proposed approach: REAL.
  • Figure 3: Performance gains on Qwen3-32B. Our method achieves an average increase of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model.
  • Figure 4: Response length and entropy during REAL training. Response length increases, and the per-token entropy of the policy model decreases steadily.

Theorems & Definitions (4)

  • Lemma 3.1: Optimality of Squared Error for Pearson Correlation
  • Lemma 4.1: Generalized Policy Gradient with Policy-Dependent Rewards for Regression
  • Lemma 2.1: Optimality of the Posterior Mean for Pearson Correlation
  • Lemma 2.2: Optimality of Regression Objective