Table of Contents
Fetching ...

Constrained Group Relative Policy Optimization

Roger Girgis, Rodrigue de Schaetzen, Luke Rowe, Azalée Robitaille, Christopher Pal, Liam Paull

TL;DR

Constrained GRPO addresses constrained policy optimization for large multimodal models by integrating indicator-cost constraints with a Lagrangian multiplier framework atop GRPO. A key finding is that scalarizing rewards before group normalization can implicitly reweight terms due to within-group variance and covariance, misaligning multipliers with intended trade-offs; the authors prove this effect and propose scalarizing advantages instead, i.e., $A_{ ext{ScAdv}} = \lambda_R Z_R - \sum_{k=1}^K \lambda_k Z_{C_k}$, to preserve multiplier semantics. Empirical results in a gridworld and in NAVSIM v2 driving benchmarks show that scalarizing advantages yields more stable constraint enforcement and higher task performance under constrained GRPO, establishing a practical, scalable approach for constrained policy optimization in embodied AI with large foundation-model backbones. The work provides a concrete recipe for balancing behavior constraints with task objectives in critic-free policy optimization, with broad relevance to safety-sensitive multimodal systems.

Abstract

While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.

Constrained Group Relative Policy Optimization

TL;DR

Constrained GRPO addresses constrained policy optimization for large multimodal models by integrating indicator-cost constraints with a Lagrangian multiplier framework atop GRPO. A key finding is that scalarizing rewards before group normalization can implicitly reweight terms due to within-group variance and covariance, misaligning multipliers with intended trade-offs; the authors prove this effect and propose scalarizing advantages instead, i.e., , to preserve multiplier semantics. Empirical results in a gridworld and in NAVSIM v2 driving benchmarks show that scalarizing advantages yields more stable constraint enforcement and higher task performance under constrained GRPO, establishing a practical, scalable approach for constrained policy optimization in embodied AI with large foundation-model backbones. The work provides a concrete recipe for balancing behavior constraints with task objectives in critic-free policy optimization, with broad relevance to safety-sensitive multimodal systems.

Abstract

While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.
Paper Structure (26 sections, 2 theorems, 30 equations, 6 figures, 2 tables)

This paper contains 26 sections, 2 theorems, 30 equations, 6 figures, 2 tables.

Key Result

Theorem 4.1

Let and define the scalarized reward $R_S := \boldsymbol{\lambda}^\top \mathbf{x}$. Denote the within-group mean and covariance of $\mathbf{x}$ by $\boldsymbol{\mu}:=\mathbb{E}[\mathbf{x}]$ and $\mathbf{\Sigma}:=\mathrm{Cov}(\mathbf{x})$, respectively, and let Further, define per-component standardizations so that where $\mu_{(\cdot)}$ and $\sigma_{(\cdot)}$ denote the within-group mean and st

Figures (6)

  • Figure 1: Overview of our proposed approach. We extend GRPO to constrained policy optimization by enforcing user-specified behavioral constraints with thresholds (left). Naively scalarizing reward distorts reward–constraint trade-offs, producing inconsistent effective weights (middle; e.g., red/green swap). Scalarized advantages preserves multiplier semantics, yielding stable learning and effective constraint enforcement (right).
  • Figure 2: Final performance of GRPO policies evaluated over 1,000 episodes in the gridworld with fixed penalty weights, comparing scalarized rewards (blue) and scalarized advantages (red). Scalarized advantages consistently achieves higher goal-reaching rates while tracking the intended trade-off induced by the lava-weight sweep more faithfully, whereas scalarized rewards suppresses lava visitation disproportionately.
  • Figure 3: Learning dynamics on the gridworld under Constrained GRPO, comparing scalarized rewards (left) with scalarized advantages (right). Scalarized advantages yields more stable training and uses the available cost budget, with behavior rates adapting toward the specified thresholds as the Lagrangian multipliers adjust. In contrast, scalarized rewards collapses constraint rates toward near-zero early in training with the effective weights remaining larger and noisier despite small learned multipliers.
  • Figure 4: Learning dynamics on the gridworld under Constrained GRPO, comparing scalarized advantages (left) and scalarized rewards (right). In this experiment, we set $\tilde{d}_{\text{lava}}=0.01$ and do not enforce the battery constraint. Scalarized advantages again produces a stronger final policy and makes effective use of the allowable cost budget, with the lava rate touching the specified threshold in order to allow for success on the main task. In contrast, scalarized rewards drives the constraint rate to near-zero early in training; despite small multipliers values, the corresponding effective weights remain larger and noisier, leading to less stable learning and lower final performance.
  • Figure 5: Learning dynamics on the gridworld under Constrained GRPO, comparing scalarized advantages (left) and scalarized rewards (right). In this experiment, we set $\tilde{d}_{\text{lava}}=0.01$ and $\tilde{d}_{\text{battery}}=0.1$. Scalarized advantages again produces a stronger final policy and makes effective use of its allowable costs' budget, with both the lava rate and battery rate stabilizing near their specified thresholds in order to allow for success on the main task. In contrast, scalarizing rewards drives both constraints' cost rate to near-zero early in training. Even with small values of multipliers, the corresponding effective weights remain larger and noisier, leading to less stable learning and lower final performance. Interestingly, we see that when it tries to creep higher, the effective multipliers react excessively, resulting in the agent not adequately navigating the environment, as shown by the poor goal success rate.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 4.1: Implicit reweighting under scalarized rewards
  • Theorem 1.1: Theorem \ref{['thm:scalarized_rewards_reweighting_main_text']} (restated)
  • proof