Table of Contents
Fetching ...

Evidence on the Regularisation Properties of Maximum-Entropy Reinforcement Learning

Rémy Hosseinkhan Boucher, Onofrio Semeraro, Lionel Mathelin

TL;DR

This work analyzes the robustness and generalisation of policies learned via Maximum-Entropy Reinforcement Learning in chaotic PO-MDPs with Gaussian observation noise. It formalises robustness to observation noise through excess risk under noise and demonstrates that entropy regularisation correlates with improved robustness and a flatter, more regular loss landscape. The study ties robustness to learning-theory complexity measures, showing norm-based capacity metrics and the trace of the Fisher Information decrease with entropy, indicating a link between regularity and robustness. Through experiments on Lorenz and Kuramoto--Sivashinsky dynamics using PPO with varying entropy levels, it provides evidence that entropy regularisation acts as a regulariser and reduces average Fisher Information, with practical implications for designing robust entropy-regularised RL algorithms.

Abstract

The generalisation and robustness properties of policies learnt through Maximum-Entropy Reinforcement Learning are investigated on chaotic dynamical systems with Gaussian noise on the observable. First, the robustness under noise contamination of the agent's observation of entropy regularised policies is observed. Second, notions of statistical learning theory, such as complexity measures on the learnt model, are borrowed to explain and predict the phenomenon. Results show the existence of a relationship between entropy-regularised policy optimisation and robustness to noise, which can be described by the chosen complexity measures.

Evidence on the Regularisation Properties of Maximum-Entropy Reinforcement Learning

TL;DR

This work analyzes the robustness and generalisation of policies learned via Maximum-Entropy Reinforcement Learning in chaotic PO-MDPs with Gaussian observation noise. It formalises robustness to observation noise through excess risk under noise and demonstrates that entropy regularisation correlates with improved robustness and a flatter, more regular loss landscape. The study ties robustness to learning-theory complexity measures, showing norm-based capacity metrics and the trace of the Fisher Information decrease with entropy, indicating a link between regularity and robustness. Through experiments on Lorenz and Kuramoto--Sivashinsky dynamics using PPO with varying entropy levels, it provides evidence that entropy regularisation acts as a regulariser and reduces average Fisher Information, with practical implications for designing robust entropy-regularised RL algorithms.

Abstract

The generalisation and robustness properties of policies learnt through Maximum-Entropy Reinforcement Learning are investigated on chaotic dynamical systems with Gaussian noise on the observable. First, the robustness under noise contamination of the agent's observation of entropy regularised policies is observed. Second, notions of statistical learning theory, such as complexity measures on the learnt model, are borrowed to explain and predict the phenomenon. Results show the existence of a relationship between entropy-regularised policy optimisation and robustness to noise, which can be described by the chosen complexity measures.

Paper Structure

This paper contains 23 sections, 8 equations, 4 figures.

Figures (4)

  • Figure 1: Distributional representation of the rate of excess risk under noise $\mathring{\mathcal{R}}^{\pi}$ conditioned on the $\alpha^i$ used during optimisation for different initial state distribution $X_0 \sim \mathcal{N}(x_e^{*},\, \sigma_{e}^2 I_d)$. Each of the rows corresponds to one of the dynamical systems of interest. Each of the columns corresponds to one of the initial state distributions of interest. There are two non-zero fixed points (equilibria) $x_e^*$ for Lorenz and three for KS. From top to bottom: KS; Lorenz. For each box plot, three intensities $\sigma_Y$ for the observation noise $\epsilon$ are evaluated. As expected, when the uncertainty regarding the observable $Y$ increases through the variance $\sigma_Y$ of the observation signal noise $\epsilon$, the policy performance decreases globally ($\mathring{\mathcal{R}}^{\pi}$ increases). Moreover, the rate of excess risk under noise tends to decrease when $\alpha^i$ increases in the Lorenz case, whereas it decreases up to a certain entropy threshold for KS before increasing again.
  • Figure 2: Measures of complexity $\mathcal{M}(\pi_\theta, \mathcal{D}) = \Pi^l_{i=1} \| \theta_\mu^i \|_p$ with $p = 1,\, 2,\, \infty,\, F$ conditioned on the $\alpha^i$ used during optimisation. Each row corresponds to one of the dynamical systems of interest while column represents a different norm order $p$. From top to bottom: Lorenz and KS. For the Lorenz case, the barycenters of the measures tend to decrease when $\alpha^i$ increases. Regarding KS, passing a threshold, the complexity increases again with the entropy. In addition, the measures are much more concentrated when $\alpha^i > 0$. For $p = 2,\, F$, the separation of the measures w.r.t. the different $\alpha^i$ is more pronounced.
  • Figure 3: Distribution of the trace of the (conditional) Fisher information of the numerical optimal solution $\theta^*_{\mu, \alpha^i}$ for the policy w.r.t. the $\alpha^i$ used during optimisation. From left to right: Lorenz and KS environments. Colours: control experiment $\alpha^i = 0$ (black); intermediate entropy level $\alpha^i$ (blue); largest $\alpha^i$ (red). A skewed distribution towards (relatively) larger values is observed for all controlled dynamical systems. Moreover, those right tails exhibit high kurtosis, especially for the control experiment (black) and the model with the larger entropy coefficient (red) for the KS system. Finally, solutions with intermediate entropy levels (blue) are much more concentrated - have lower variance than the others. About Lorenz, the barycenter of the more robust model (red) is shifted towards lower values than the others.
  • Figure 4: Evolution of $\overline{D}_{KL}\left( \theta^\alpha_m,\, \theta^\alpha_{m + 1} \right)$ during training for the Lorenz and KS controlled differential equations. For Lorenz, the maximal divergence is reached for the optimisation performed with $\alpha^i = 0$ and the second lowest $\alpha^i$. Regarding KS, the highest divergence values are observed for $\alpha^i = 0$ and the maximal entropy coefficient.

Theorems & Definitions (4)

  • definition thmcounterdefinition: Excess Risk Under Noise
  • definition thmcounterdefinition: Complexity measure
  • definition thmcounterdefinition: (Rate of) Excess Risk Under Noise Bound
  • definition thmcounterdefinition: Conditional Fisher Information Matrix