Table of Contents
Fetching ...

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

Kirill Pavlenko, Alexander Golubev, Simon Karasik, Boris Yangel

TL;DR

Blockwise Advantage Estimation introduces Blockwise Advantage Estimation (BAE), a GRPO-compatible framework for multi-objective RL in structured generations that assigns per-block advantages to corresponding text segments. It tackles the key challenge of baselines for later blocks by introducing Outcome-Conditioned Baselines (OCB), which approximate conditional state values using within-group statistics without requiring additional rollouts. Empirically, BAE with OCB yields competitive accuracy and calibration compared with reward-design baselines like RLCR, while preserving test-time gains from confidence-weighted ensembling; it also demonstrates applicability to multi-attempt refinement and long-horizon generation. The approach offers a modular recipe for scalable, multi-objective credit assignment in long-context generation, with clear limitations around strata population and known segment boundaries, and future work exploring richer conditioning and broader evaluations.

Abstract

Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion. For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit. We propose Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to the tokens in the corresponding text block, reducing reliance on hand-designed scalar rewards and scaling naturally to additional objectives. A key challenge is estimating advantages for later blocks whose rewards are conditioned on sampled prefixes; standard unbiased approaches require expensive nested rollouts from intermediate states. Concretely, we introduce an Outcome-Conditioned Baseline that approximates intermediate state values using only within-group statistics by stratifying samples according to a prefix-derived intermediate outcome. On math tasks with uncertainty estimation, our method mitigates reward interference, is competitive with a state-of-the-art reward-designed approach, and preserves test-time gains from confidence-weighted ensembling. More broadly, it provides a modular recipe for optimizing sequential objectives in structured generations without additional rollouts.

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

TL;DR

Blockwise Advantage Estimation introduces Blockwise Advantage Estimation (BAE), a GRPO-compatible framework for multi-objective RL in structured generations that assigns per-block advantages to corresponding text segments. It tackles the key challenge of baselines for later blocks by introducing Outcome-Conditioned Baselines (OCB), which approximate conditional state values using within-group statistics without requiring additional rollouts. Empirically, BAE with OCB yields competitive accuracy and calibration compared with reward-design baselines like RLCR, while preserving test-time gains from confidence-weighted ensembling; it also demonstrates applicability to multi-attempt refinement and long-horizon generation. The approach offers a modular recipe for scalable, multi-objective credit assignment in long-context generation, with clear limitations around strata population and known segment boundaries, and future work exploring richer conditioning and broader evaluations.

Abstract

Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion. For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit. We propose Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to the tokens in the corresponding text block, reducing reliance on hand-designed scalar rewards and scaling naturally to additional objectives. A key challenge is estimating advantages for later blocks whose rewards are conditioned on sampled prefixes; standard unbiased approaches require expensive nested rollouts from intermediate states. Concretely, we introduce an Outcome-Conditioned Baseline that approximates intermediate state values using only within-group statistics by stratifying samples according to a prefix-derived intermediate outcome. On math tasks with uncertainty estimation, our method mitigates reward interference, is competitive with a state-of-the-art reward-designed approach, and preserves test-time gains from confidence-weighted ensembling. More broadly, it provides a modular recipe for optimizing sequential objectives in structured generations without additional rollouts.
Paper Structure (43 sections, 27 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 43 sections, 27 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: The error distribution comparing to true advantage MC estimates for three methods: Batch Mean, Group Mean, OCB w.r.t. correct/incorrect solutions. Batch Mean shows skewed approximation unlike Group Mean, while OCB demonstrates the lowest RMSE error across all groups.
  • Figure 2: TTS evaluation for RLCR and OCB methods across three datasets: (a) – MATH500, (b) – GSM8K, (c) – AIME23--25
  • Figure 3: TTS evaluation within Two-Attempt Refinement on MATH500 dataset. Second attempt demonstrates higher performance across all aggregations.
  • Figure 4: Expected Calibration Error (ECE) computed with fixed-width bins across three datasets: (a)--MATH500, (b)--GSM8K, (c)--AIME23--25