Execution-Grounded Credit Assignment for GRPO in Code Generation

Abhijit Kumar; Natalya Kumar; Shikhar Gupta

Execution-Grounded Credit Assignment for GRPO in Code Generation

Abhijit Kumar, Natalya Kumar, Shikhar Gupta

Abstract

Critic-free reinforcement learning with verifiable rewards (RLVR) improves code generation by optimizing unit-test pass rates, but GRPO-style updates suffer from coarse credit assignment: a single outcome signal is spread uniformly across long programs even when failure stems from a localized semantic error. We propose Execution-Grounded Credit Assignment (EGCA), which localizes GRPO updates using execution traces. For programs that satisfy algorithmic constraints but fail tests, EGCA executes the candidate and a canonical reference solution (curated once offline; used for analysis, not supervision) under identical instrumentation, identifies the earliest semantic divergence, and assigns advantage only to the corresponding token span while masking downstream tokens. EGCA is a drop-in modification requiring no critic, auxiliary loss, or learned verifier, yielding 82.1% pass@1 on HumanEval (+3.1 over GRPO) and 68.9% on MBPP (+1.5) with 18% wall-clock overhead.

Execution-Grounded Credit Assignment for GRPO in Code Generation

Abstract

Paper Structure (56 sections, 10 equations, 2 figures, 8 tables)

This paper contains 56 sections, 10 equations, 2 figures, 8 tables.

Introduction
Related Work
Method
Problem Setting
Canonical Solutions as a Structural Pivot
Constraint-Guided Sampling
Constraint extraction.
Sampling with constraints.
Constraint satisfaction.
Structural Validation (Comparability Gate)
Failure Modes and Credit Operator
Syntax span.
Divergence span.
Token-level advantage operator.
Execution-Grounded Divergence Localization
...and 41 more sections

Figures (2)

Figure 1: Motivation: credit smear vs. localized updates. Sequence-level RLVR objectives apply unit-test outcomes uniformly across long programs, penalizing large spans of correct code for a localized semantic bug. EGCA concentrates gradient mass on the earliest semantically divergent span (identified via execution) while masking downstream tokens, improving credit assignment in the near-correct regime.
Figure 2: EGCA pipeline. We extract constraints from a canonical reference, sample and execute a group of programs, route each into syntax/constraint/logic/correct via deterministic gates, and apply token-level GRPO by localizing advantage (compiler span for syntax, earliest reference-trace divergence for logic) while masking downstream tokens.

Execution-Grounded Credit Assignment for GRPO in Code Generation

Abstract

Execution-Grounded Credit Assignment for GRPO in Code Generation

Authors

Abstract

Table of Contents

Figures (2)