Trust Regions Sell, But Who's Buying? Overlap Geometry as an Alternative Trust Region for Policy Optimization

Gaurish Trivedi; Alakh Sharma; Kartikey Singh Bhandari; Yash Sinha; Pratik Narang; Dhruv Kumar; Jagat Sesh Challa

Trust Regions Sell, But Who's Buying? Overlap Geometry as an Alternative Trust Region for Policy Optimization

Gaurish Trivedi, Alakh Sharma, Kartikey Singh Bhandari, Yash Sinha, Pratik Narang, Dhruv Kumar, Jagat Sesh Challa

TL;DR

This work addresses instability in policy-gradient RL arising from rare, large likelihood-ratio excursions under KL-based trust regions. It introduces overlap geometry by parameterizing policies with square-root densities $\psi_\theta(a|s)=\sqrt{\pi_\theta(a|s)}$ and using the Bhattacharyya coefficient $\rho_s(\theta,\theta')=\langle\psi_\theta,\psi_{\theta'}\rangle$, which induces a Fisher-like local geometry but remains bounded. The authors derive a first-order surrogate $L_{Hell}(\theta)=\mathbb{E}_{old}[2(q_\theta-1)A_{old}]$ with $q_\theta=\sqrt{r_\theta}$ and instantiate BPPO (clipped $q$) and BTRPO (Hellinger regularization) as practical on-policy algorithms, offering principled tail control without KL clipping. Across MuJoCo, DM Control, and Procgen benchmarks with matched budgets, overlap-based updates improve robustness and aggregate performance, with BPPO providing the strongest gains and smooth update behavior. This overlap geometry thereby offers a principled, scalable alternative to KL for stable policy optimization and motivates adaptive tuning and broader evaluations.

Abstract

Standard trust-region methods constrain policy updates via Kullback-Leibler (KL) divergence. However, KL controls only an average divergence and does not directly prevent rare, large likelihood-ratio excursions that destabilize training--precisely the failure mode that motivates heuristics such as PPO's clipping. We propose overlap geometry as an alternative trust region, constraining distributional overlap via the Bhattacharyya coefficient (closely related to the Hellinger/Renyi-1/2 geometry). This objective penalizes separation in the ratio tails, yielding tighter control over likelihood-ratio excursions without relying on total variation bounds that can be loose in tail regimes. We derive Bhattacharyya-TRPO (BTRPO) and Bhattacharyya-PPO (BPPO), enforcing overlap constraints via square-root ratio updates: BPPO clips the square-root ratio q = sqrt(r), and BTRPO applies a quadratic Hellinger/Bhattacharyya penalty. Empirically, overlap-based updates improve robustness and aggregate performance as measured by RLiable under matched training budgets, suggesting overlap constraints as a practical, principled alternative to KL for stable policy optimization.

Trust Regions Sell, But Who's Buying? Overlap Geometry as an Alternative Trust Region for Policy Optimization

TL;DR

and using the Bhattacharyya coefficient

, which induces a Fisher-like local geometry but remains bounded. The authors derive a first-order surrogate

with

and instantiate BPPO (clipped

) and BTRPO (Hellinger regularization) as practical on-policy algorithms, offering principled tail control without KL clipping. Across MuJoCo, DM Control, and Procgen benchmarks with matched budgets, overlap-based updates improve robustness and aggregate performance, with BPPO providing the strongest gains and smooth update behavior. This overlap geometry thereby offers a principled, scalable alternative to KL for stable policy optimization and motivates adaptive tuning and broader evaluations.

Abstract

Paper Structure (77 sections, 23 equations, 37 figures, 10 tables, 1 algorithm)

This paper contains 77 sections, 23 equations, 37 figures, 10 tables, 1 algorithm.

Introduction
Research questions.
Empirical motivation. What actually goes wrong in practice?
Derivation
Square-Root Policy Geometry
Hellinger--Bhattacharyya Geometry and Local Fisher Structure
Bhattacharyya coefficient and Hellinger distance.
Local equivalence to the Fisher metric.
State-averaged overlap.
A First-Order Surrogate with Square-Root Ratios
Implemented form.
BTRPO: BC/Hellinger regularization.
BPPO: clipped square-root surrogate.
Why square-root ratios reduce tail sensitivity?
Hellinger-Regularized Trust Region
...and 62 more sections

Figures (37)

Figure 1: A motivating signature of update shrinkage on Humanoid (9 seeds (0--8), 10M steps). (a) BPPO continues to improve throughout training while PPO plateaus early. (b--c) This gap coincides with markedly different likelihood-ratio dynamics: PPO’s likelihood-ratio statistics contract over training (both the mean and the upper tail move toward $1$ or below), consistent with increasingly constrained effective updates, whereas BPPO maintains near-unity mean ratio and a stable, non-trivial upper tail. Shaded regions indicate variability across seeds (as plotted).
Figure 2: MuJoCo learning curves (9 seeds (0--8)). Mean episodic return over training on six MuJoCo tasks; shaded regions indicate $\pm 1$ standard deviation across seeds. BPPO generally achieves higher returns and more sustained late-stage improvement on harder locomotion tasks.
Figure 3: DM Control learning curves (9 seeds (0--8)). Average episodic return vs. environment steps on five tasks; shaded regions indicate $\pm 1$ standard deviation across seeds. BPPO shows stronger learning progress on cheetah_run, hopper_hop, and walker_walk, while PPO remains competitive on cartpole_swingup.
Figure 4: BipedalWalker-v3 baseline learning curves (4 seeds).
Figure 5: BipedalWalker-v3 baseline with entropy bonus learning curves (4 seeds).
...and 32 more figures

Trust Regions Sell, But Who's Buying? Overlap Geometry as an Alternative Trust Region for Policy Optimization

TL;DR

Abstract

Trust Regions Sell, But Who's Buying? Overlap Geometry as an Alternative Trust Region for Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (37)