Table of Contents
Fetching ...

Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

Zelal Su, Mustafaoglu, Sungyoung Lee, Eshan Balachandar, Risto Miikkulainen, Keshav Pingali

Abstract

Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: $K$ PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural parameter space of the policy distribution via the logarithmic opinion pool. In natural parameter space, the consensus provably achieves higher KL-penalized surrogate and tighter trust region compliance than the mean expert; parameter averaging inherits these guarantees approximately. On continuous control tasks, CAPO outperforms PPO and compute-matched deeper baselines under fixed sample budgets by up to 8.6x. CAPO demonstrates that policy optimization can be improved by optimizing wider, rather than deeper, without additional environment interactions.

Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

Abstract

Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural parameter space of the policy distribution via the logarithmic opinion pool. In natural parameter space, the consensus provably achieves higher KL-penalized surrogate and tighter trust region compliance than the mean expert; parameter averaging inherits these guarantees approximately. On continuous control tasks, CAPO outperforms PPO and compute-matched deeper baselines under fixed sample budgets by up to 8.6x. CAPO demonstrates that policy optimization can be improved by optimizing wider, rather than deeper, without additional environment interactions.
Paper Structure (36 sections, 3 theorems, 6 equations, 4 figures, 12 tables, 1 algorithm)

This paper contains 36 sections, 3 theorems, 6 equations, 4 figures, 12 tables, 1 algorithm.

Key Result

Proposition 1

Under decomposition eq:decomp:

Figures (4)

  • Figure 1: Geometry of CAPO updates.(a) REINFORCE moves along the gradient $g$ (solid) (approximately), while TRPO moves along the natural gradient $F^{-1}g$ (approximately). Each PPO expert's move $\Delta^k$ decomposes into a signal ($\bullet$ , projection onto $F^{-1}g$) and waste (dashed, Fisher-orthogonal residual). In this diagram, waste points left for expert $\theta^1$ and right for expert $\theta^2$ so averaging reduces waste, and the consensus $\theta^{agg}$ lies closer to $F^{-1}g$ with lower KL than either expert (Theorem \ref{['thm:consensus']}). (b) Return and signal--waste KL decomposition of the final PPO update for varying epoch counts $E$ on Hopper, mean over 3 seeds. Return peaks at $E\!=\!10$ then collapses as waste overwhelms the trust region budget.
  • Figure 2: CAPO pipeline. One on-policy batch is collected from the incumbent $\pi_t$ and fed to $K$ independent PPO copies that differ only in minibatch shuffle order. Expert policies are aggregated into a consensus $\pi_{t+1}$ in parameter space (avg) or distribution space (LogOP).
  • Figure 3: Gymnasium learning curves (8 seeds, shaded $\pm 1$ SE). CAPO leads on all tasks except Hopper, which is dominated by CAPO-Avg. PPO-$K\!\times$ collapses on Ant ($7\!\times$ below PPO) and Walker2d, validating the optimization-depth dilemma.
  • Figure 4: Fisher diagnostics for the last PPO update vs. epoch count (HalfCheetah, $(64,64)$ network, 3 seeds). Left: waste $\|\epsilon\|_F^2$ and signal $c^2$ on log scale. Center: total KL and return. Right: Fisher alignment. Waste grows $21\!\times$ from $E\!=\!2$ to $E\!=\!40$; returns peak at $E\!\approx\!6$--$15$ then decline; alignment $\alpha$ peaks at $0.46$ then drops to $0.26$.

Theorems & Definitions (7)

  • Definition 1: Fisher signal--waste decomposition
  • Proposition 1: Signal--waste separation
  • proof
  • Theorem 2: Consensus improvement in natural parameter space
  • proof
  • Remark 1: Connection to the Fisher decomposition
  • Corollary 3: Consensus is a better policy improvement step