Trust Regions Sell, But Who's Buying? Overlap Geometry as an Alternative Trust Region for Policy Optimization
Gaurish Trivedi, Alakh Sharma, Kartikey Singh Bhandari, Yash Sinha, Pratik Narang, Dhruv Kumar, Jagat Sesh Challa
TL;DR
This work addresses instability in policy-gradient RL arising from rare, large likelihood-ratio excursions under KL-based trust regions. It introduces overlap geometry by parameterizing policies with square-root densities $\psi_\theta(a|s)=\sqrt{\pi_\theta(a|s)}$ and using the Bhattacharyya coefficient $\rho_s(\theta,\theta')=\langle\psi_\theta,\psi_{\theta'}\rangle$, which induces a Fisher-like local geometry but remains bounded. The authors derive a first-order surrogate $L_{Hell}(\theta)=\mathbb{E}_{old}[2(q_\theta-1)A_{old}]$ with $q_\theta=\sqrt{r_\theta}$ and instantiate BPPO (clipped $q$) and BTRPO (Hellinger regularization) as practical on-policy algorithms, offering principled tail control without KL clipping. Across MuJoCo, DM Control, and Procgen benchmarks with matched budgets, overlap-based updates improve robustness and aggregate performance, with BPPO providing the strongest gains and smooth update behavior. This overlap geometry thereby offers a principled, scalable alternative to KL for stable policy optimization and motivates adaptive tuning and broader evaluations.
Abstract
Standard trust-region methods constrain policy updates via Kullback-Leibler (KL) divergence. However, KL controls only an average divergence and does not directly prevent rare, large likelihood-ratio excursions that destabilize training--precisely the failure mode that motivates heuristics such as PPO's clipping. We propose overlap geometry as an alternative trust region, constraining distributional overlap via the Bhattacharyya coefficient (closely related to the Hellinger/Renyi-1/2 geometry). This objective penalizes separation in the ratio tails, yielding tighter control over likelihood-ratio excursions without relying on total variation bounds that can be loose in tail regimes. We derive Bhattacharyya-TRPO (BTRPO) and Bhattacharyya-PPO (BPPO), enforcing overlap constraints via square-root ratio updates: BPPO clips the square-root ratio q = sqrt(r), and BTRPO applies a quadratic Hellinger/Bhattacharyya penalty. Empirically, overlap-based updates improve robustness and aggregate performance as measured by RLiable under matched training budgets, suggesting overlap constraints as a practical, principled alternative to KL for stable policy optimization.
