Fractal Landscapes in Policy Optimization

Tao Wang; Sylvia Herbert; Sicun Gao

Fractal Landscapes in Policy Optimization

Tao Wang, Sylvia Herbert, Sicun Gao

TL;DR

The paper addresses why policy-gradient methods can fail in continuous-control RL by revealing that the policy-space loss landscape can be fractal due to chaotic dynamics. It introduces a dynamical-systems framework using the maximal Lyapunov exponent $\lambda_{\max}$ and Hölder continuity to characterize local smoothness, proving that $V^{\pi_\theta}$ and $J(\theta)$ are $\frac{-\log\gamma}{\lambda(\theta)}$-Hölder when $\lambda(\theta)>-\log\gamma$, which can preclude differentiability if the exponent is below 1. A practical sampling-based estimator is proposed to detect fractal regions from finite samples, and experiments on inverted pendulum, acrobot, and hopper illustrate how fractal landscapes explain observed training failures and fluctuations in policy-gradient methods. The findings imply fundamental limits of first-order RL optimization in certain MDPs and offer a diagnostic tool for assessing local smoothness during training, with potential impacts on algorithm design and hyperparameter selection. The work thus connects chaos theory with RL optimization to explain and quantify non-smoothness beyond numerical noise.

Abstract

Policy gradient lies at the core of deep reinforcement learning (RL) in continuous domains. Despite much success, it is often observed in practice that RL training with policy gradient can fail for many reasons, even on standard control problems with known solutions. We propose a framework for understanding one inherent limitation of the policy gradient approach: the optimization landscape in the policy space can be extremely non-smooth or fractal for certain classes of MDPs, such that there does not exist gradient to be estimated in the first place. We draw on techniques from chaos theory and non-smooth analysis, and analyze the maximal Lyapunov exponents and Hölder exponents of the policy optimization objectives. Moreover, we develop a practical method that can estimate the local smoothness of objective function from samples to identify when the training process has encountered fractal landscapes. We show experiments to illustrate how some failure cases of policy optimization can be explained by such fractal landscapes.

Fractal Landscapes in Policy Optimization

TL;DR

and Hölder continuity to characterize local smoothness, proving that

and

are

-Hölder when

, which can preclude differentiability if the exponent is below 1. A practical sampling-based estimator is proposed to detect fractal regions from finite samples, and experiments on inverted pendulum, acrobot, and hopper illustrate how fractal landscapes explain observed training failures and fluctuations in policy-gradient methods. The findings imply fundamental limits of first-order RL optimization in certain MDPs and offer a diagnostic tool for assessing local smoothness during training, with potential impacts on algorithm design and hyperparameter selection. The work thus connects chaos theory with RL optimization to explain and quantify non-smoothness beyond numerical noise.

Abstract

Paper Structure (28 sections, 6 theorems, 41 equations, 8 figures)

This paper contains 28 sections, 6 theorems, 41 equations, 8 figures.

Introduction
Related work
Preliminaries
Dynamical Systems as Markov Decision Processes
Policy gradient methods
Maximal Lyapunov Exponents
Fractal Landscapes
Fractal Landscapes in the Policy Space
Hölder Exponent of ${V^{\pi_\theta}(\cdot)}$
Proof sketch of Theorem \ref{['valuecon']}:
Hölder Exponent of $J(\cdot)$
Stochastic Policies
Estimating Hölder Exponents from Samples
Experiments
Inverted Pendulum.
...and 13 more sections

Key Result

Proposition 3.1

(falconer) Let $F \subset \mathbb{R}^k$ be a subset and suppose that $\eta: F \rightarrow \mathbb{R}^p$ is $\alpha$-Hölder continuous where $\alpha > 0$, then $\dim_H \eta(F) \leq \frac{1}{\alpha} \dim_H F$.

Figures (8)

Figure 1: An illustration of the two series \ref{['t2']} and \ref{['t3']} that need to cover the entire $\mathbb{R}$ when $\delta \rightarrow 0$.
Figure 2: The value of MLE $\lambda(\theta)$ for $\theta \in [3.3, 3.9]$ is shown in \ref{['fig:mle']}. The graph of objective function $J(\theta)$ for different values of $\gamma$ are shown in \ref{['fig:0.5']}-\ref{['fig:0.99zoomin']} where $J(\theta)$ is estimated by the sum of first 1000 terms in the infinite series.
Figure 3: The experimental results of inverted pendulum. In \ref{['fig:ip4']}, the linear regression result is obtained for $\gamma = 0.9$. The loss curves $J(\theta)$ are presented in \ref{['fig:ip1']}-\ref{['fig:ip5']} where $\theta = \theta_0 + \delta \eta(\theta_0)$ with step size $10^{-7}$.
Figure 4: The experimental results of acrobot. In Figure \ref{['fig:acrobot4']}, the linear regression result is obtained for $\gamma = 0.9$. The loss curves $J(\theta)$ are presented in \ref{['fig:acrobot1']}-\ref{['fig:acrobot5']} where $\theta = \theta_0 + \delta \eta(\theta_0)$ with step size $10^{-7}$.
Figure 5: The experimental results of hopper. In Figure \ref{['fig:hopperlinear']}, the linear regression result is obtained for $\gamma = 0.9$. The loss curves $J(\theta)$ are presented in \ref{['fig:hopper25']}-\ref{['fig:hopper99']} where $\theta = \theta_0 + \delta \eta(\theta_0)$ with step size $10^{-3}$.
...and 3 more figures

Theorems & Definitions (22)

Definition 3.1
Definition 3.2
Definition 3.3
Definition 3.4
Definition 3.5
Proposition 3.1
Theorem 4.1
Example 4.1
Remark 4.1
Remark 4.2
...and 12 more

Fractal Landscapes in Policy Optimization

TL;DR

Abstract

Fractal Landscapes in Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (22)