Mollification Effects of Policy Gradient Methods

Tao Wang; Sylvia Herbert; Sicun Gao

Mollification Effects of Policy Gradient Methods

Tao Wang, Sylvia Herbert, Sicun Gao

TL;DR

The paper tackles why policy gradient methods can succeed or fail on non-smooth, chaotic reinforcement learning problems by introducing mollification through stochastic policy noise. It shows that policy gradient updates correspond to gradient ascent on the heat equation solution, with the Gaussian noise acting as a mollifier that suppresses high-frequency, fractal components but can diverge from the true objective if overused. A key theoretical result is that the backward heat problem is ill-posed, implying a fundamental trade-off: too little smoothing preserves problematic landscape features, while too much smoothing can erase the optimal policy, a tension formalized via the uncertainty principle. Experiments across Hopper, double pendulum, and planar quadrotor tasks illustrate both the stabilizing and destabilizing effects of mollification, providing practical insight into choosing the stochasticity level in policy search for nonlinear and chaotic dynamics.

Abstract

Policy gradient methods have enabled deep reinforcement learning (RL) to approach challenging continuous control problems, even when the underlying systems involve highly nonlinear dynamics that generate complex non-smooth optimization landscapes. We develop a rigorous framework for understanding how policy gradient methods mollify non-smooth optimization landscapes to enable effective policy search, as well as the downside of it: while making the objective function smoother and easier to optimize, the stochastic objective deviates further from the original problem. We demonstrate the equivalence between policy gradient methods and solving backward heat equations. Following the ill-posedness of backward heat equations from PDE theory, we present a fundamental challenge to the use of policy gradient under stochasticity. Moreover, we make the connection between this limitation and the uncertainty principle in harmonic analysis to understand the effects of exploration with stochastic policies in RL. We also provide experimental results to illustrate both the positive and negative aspects of mollification effects in practice.

Mollification Effects of Policy Gradient Methods

TL;DR

Abstract

Paper Structure (35 sections, 12 theorems, 41 equations, 13 figures, 2 tables)

This paper contains 35 sections, 12 theorems, 41 equations, 13 figures, 2 tables.

Introduction
Related Work
Optimization Landscapes in RL.
Policy Gradient over Non-Smooth Landscapes.
Mollification in Stochastic Optimization.
Preliminaries
Policy gradient methods.
Fractal landscapes in RL.
Cauchy problem for heat equations.
The Dynamics of Policy Improvement
Smoothing by mollication.
Mollified optimization landscape.
Policy parameterization.
Anisotropic Gaussian distributions.
The Limitations of Mollification
...and 20 more sections

Key Result

Proposition 3.1

twang Assume that the dynamics, reward function and policy are all Lipschitz continuous with respect to their input variables. Let $\pi_\theta$ be a deterministic policy and $\lambda(\theta)$ denote the MLE of the system. Suppose that $\lambda(\theta) > -\log \gamma$, then

Figures (13)

Figure 1: The Gaussian kernel in the policy gradient mollifies the optimization landscape. However, when the variance $\sigma^2$ is too small, the landscape remains highly non-smooth. Conversely, if the variance is too large, the Gaussian kernel over-smooths the landscape, eliminating the optimal solution. Both of these lead to failures in the hopper stand task. Details are avaliable in Section \ref{['sec:experiment']}.
Figure 2: Fractal landscapes occur in chaotic MDPs. For instance, the objective landscape of the double pendulum system as shown in (a) has a fractal structure, in contrast to the non-chaotic single pendulum system in (b). Both systems are controlled by deterministic neural network policies.
Figure 3: (a) The heat equation smooths the initial temperature distribution as $t$ increases; (b) The gradient flow of $u(\mu, \sigma^2)$ in the solution space.
Figure 4: Hopper stand: the hopper failed to learn standing when $\sigma = 0.005$.
Figure 5: Hopper stand: the hopper successfully learned to stand when $\sigma = 0.05$.
...and 8 more figures

Theorems & Definitions (21)

Proposition 3.1
Proposition 4.1
Proposition 4.2: Strong Maximum Principle evans
Theorem 4.3
proof
Theorem 4.4
Theorem 4.5
Proposition 5.1
Theorem 5.2: Ill-posedness around deterministic policies
Remark 5.3
...and 11 more

Mollification Effects of Policy Gradient Methods

TL;DR

Abstract

Mollification Effects of Policy Gradient Methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (21)