Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

Wang Zixian

Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

Wang Zixian

TL;DR

Experiments show that GOPO achieves competitive generalization while maintaining stable gradient dynamics and entropy preservation in regimes where clipping-based methods plateau, and achieves competitive generalization while maintaining stable gradient dynamics and entropy preservation in regimes where clipping-based methods plateau.

Abstract

We present Group Orthogonalized Policy Optimization (GOPO), a new alignment algorithm for large language models derived from the geometry of Hilbert function spaces. Instead of optimizing on the probability simplex and inheriting the exponential curvature of Kullback-Leibler divergence, GOPO lifts alignment into the Hilbert space L2(pi_k) of square-integrable functions with respect to the reference policy. Within this space, the simplex constraint reduces to a linear orthogonality condition <v, 1> = 0, defining a codimension-one subspace H0. Minimizing distance to an unconstrained target u_star yields the work-dissipation functional J(v) = <g, v> - (mu / 2) ||v||^2, whose maximizer follows directly from the Hilbert projection theorem. Enforcing the boundary v >= -1 produces a bounded Hilbert projection that induces exact sparsity, assigning zero probability to catastrophically poor actions through a closed-form threshold. To connect this functional theory with practice, GOPO projects from infinite-dimensional L2(pi_k) to a finite empirical subspace induced by group sampling. Because group-normalized advantages sum to zero, the Lagrange multiplier enforcing probability conservation vanishes exactly, reducing the constrained projection to an unconstrained empirical loss. The resulting objective has constant Hessian curvature mu I, non-saturating linear gradients, and an intrinsic dead-zone mechanism without heuristic clipping. Experiments on mathematical reasoning benchmarks show that GOPO achieves competitive generalization while maintaining stable gradient dynamics and entropy preservation in regimes where clipping-based methods plateau.

Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

TL;DR

Abstract

Paper Structure (52 sections, 10 theorems, 33 equations, 4 figures, 1 table)

This paper contains 52 sections, 10 theorems, 33 equations, 4 figures, 1 table.

Introduction
A Hilbert Space Perspective.
Contributions.
Related Work
Preference Optimization and RLHF.
$f$-Divergences and Alternative Geometries.
Trust-Region Methods.
Theoretical Framework: Policy Optimization in Hilbert Space
The Hilbert Space of Density Fluctuations
Probability Conservation as Orthogonality
The Driving Force: Metric Modulation
Methodology: Group Orthogonalized Policy Optimization
The Geometric Principle: Minimum Distance to the Unconstrained Target
Emergence of the Work-Dissipation Functional
The Orthogonal Projection Solution
...and 37 more sections

Key Result

Theorem 4.1

The optimal probability-conserving fluctuation $v^*$ is the orthogonal projection of the unconstrained target $u^*$ onto $\mathcal{H}_0$:

Figures (4)

Figure 1: Geometric interpretation of GOPO. The theory operates in $L^2(\pi_k)$. The reference policy $\pi_k$ sits at the origin ($v=0$). Valid policies must reside in $\mathcal{H}_0$ (gray plane) and satisfy non-negativity, restricting them to the feasible polytope $\mathcal{K}$ (green). The unconstrained target $u^* = g_\alpha/\mu$ (red) is first projected vertically onto $\mathcal{H}_0$ by subtracting the chemical potential $\lambda^*$, then truncated along the plane onto the $\mathcal{K}$ boundary (dead zone), yielding the final bounded GOPO update $v^*$ (blue).
Figure 2: GOPO algorithm. Step 3 guarantees that the advantage vector lies in the zero-mean subspace $\hat{\mathcal{H}}_0$, eliminating the chemical potential. Steps 4--5 implement the empirical orthogonal projection with quadratic dissipation.
Figure 3: GOPO flowchart. The core operation (green) implements the empirical orthogonal projection: the advantage signal provides the linear driving force, while the quadratic dissipation term provides the constant-curvature regularization.
Figure 4: Training dynamics comparison. (a) Training reward: OPO and GOPO achieve the highest mean rewards over the training process. (b) Validation accuracy on MATH Level 4: OPO and GSPO lead at 48%, with GOPO at 47% and monotonically improving. (c) Gradient norm: OPO/GOPO maintain healthy norms throughout, while DAPO exhibits severe gradient saturation. (d) Policy entropy: GOPO preserves the most diversity, preventing premature mode collapse.

Theorems & Definitions (18)

Remark 3.1: Geometric Interpretation
Theorem 4.1: Optimal Fluctuation via Orthogonal Projection
proof
Remark 4.2: The Chemical Potential
Theorem 4.3: Bounded Projection Solution
Corollary 4.4: Exact Sparsity
Remark 4.5: Structural Simplification
Theorem 5.1: Constant Curvature Optimization
proof
Corollary 5.2: Structural Decoupling
...and 8 more

Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

TL;DR

Abstract

Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (18)