Reinforcement Learning with Conditional Expectation Reward

Changyi Xiao; Caijun Xu; Yixin Cao

Reinforcement Learning with Conditional Expectation Reward

Changyi Xiao, Caijun Xu, Yixin Cao

TL;DR

Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule-based verifiers can be constructed. However, the reliance on handcrafted, domain-specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free-form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer. In contrast to rule-based verifiers that yield binary feedback, CER provides a soft, graded reward signal that reflects varying degrees of correctness, making it better suited to tasks where answers vary in correctness. Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism. The code is available at https://github.com/changyi7231/CER.

Reinforcement Learning with Conditional Expectation Reward

TL;DR

Abstract

Paper Structure (23 sections, 2 theorems, 19 equations, 2 figures, 3 tables)

This paper contains 23 sections, 2 theorems, 19 equations, 2 figures, 3 tables.

Introduction
Conditional Expectation Reward
RLVR
Definition
Properties
Empirical CER
Objective
Efficiency
Experiments
Settings
Datasets
Evaluation
Baselines
Hyperparameter settings
Results
...and 8 more sections

Key Result

Theorem 1

If $a=a^*$, then with equality if and only if $\pi_\theta(a^* | s, q)$ is constant over all $(q,s)$ such that $\pi_\theta(s | q)>0$.

Figures (2)

Figure 1: An illustration of CER computation, where RN($\cdot$) denotes row normalization. The left panel depicts the generation process of the quadruple $(q, s_i, a_i, a^*)$, while the right panel shows the CER computation for the quadruple, corresponding to Eq. (\ref{['equation:8']}).
Figure 2: This figure illustrates the computation of CER as defined in Eq. (\ref{['equation:8']}). The left panel shows the question, the reference answer, and the 16 generated answers. The right panel depicts the components: the reward vector $\bm{R}$ (left column), the row-normalized matrix $\bm{D}^{-1}\bm{W}$ (central block), and the reference-likelihood vector $\bm{P}$ (right column).

Theorems & Definitions (4)

Theorem 1: Exact-Match Case
proof
Theorem 2: Value Equivalence
proof

Reinforcement Learning with Conditional Expectation Reward

TL;DR

Abstract

Reinforcement Learning with Conditional Expectation Reward

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (4)