Table of Contents
Fetching ...

Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL

Guojian Zhan, Likun Wang, Pengcheng Wang, Feihong Zhang, Jingliang Duan, Masayoshi Tomizuka, Shengbo Eben Li

TL;DR

This work identifies two core bottlenecks in maximum entropy RL: non-stationary Q-value targets from entropic and temperature updates, and the limitations of local, single-step entropy tuning. It introduces Trajectory Entropy-Constrained RL (TECRL), which decouples reward and entropy learning via two Q-functions and enforces a trajectory-level entropy budget through an entropy critic, enabling long-horizon stochasticity control. The practical instantiation, DSAC-E, extends DSAC-T with four components (PEV, PIS, PIM, TUP) and a trajectory budget, yielding higher final returns and improved stability on eight MuJoCo tasks. The results demonstrate that trajectory-level entropy management can enhance both exploration efficiency and exploitation quality, with potential applicability to robotics and advanced AI systems.

Abstract

Maximum entropy has become a mainstream off-policy reinforcement learning (RL) framework for balancing exploitation and exploration. However, two bottlenecks still limit further performance improvement: (1) non-stationary Q-value estimation caused by jointly injecting entropy and updating its weighting parameter, i.e., temperature; and (2) short-sighted local entropy tuning that adjusts temperature only according to the current single-step entropy, without considering the effect of cumulative entropy over time. In this paper, we extends maximum entropy framework by proposing a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges. Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates. Then, the dedicated entropy Q-function, explicitly quantifying the expected cumulative entropy, enables us to enforce a trajectory entropy constraint and consequently control the policy long-term stochasticity. Building on this TECRL framework, we develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements (DSAC-T). Empirical results on the OpenAI Gym benchmark demonstrate that our DSAC-E can achieve higher returns and better stability.

Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL

TL;DR

This work identifies two core bottlenecks in maximum entropy RL: non-stationary Q-value targets from entropic and temperature updates, and the limitations of local, single-step entropy tuning. It introduces Trajectory Entropy-Constrained RL (TECRL), which decouples reward and entropy learning via two Q-functions and enforces a trajectory-level entropy budget through an entropy critic, enabling long-horizon stochasticity control. The practical instantiation, DSAC-E, extends DSAC-T with four components (PEV, PIS, PIM, TUP) and a trajectory budget, yielding higher final returns and improved stability on eight MuJoCo tasks. The results demonstrate that trajectory-level entropy management can enhance both exploration efficiency and exploitation quality, with potential applicability to robotics and advanced AI systems.

Abstract

Maximum entropy has become a mainstream off-policy reinforcement learning (RL) framework for balancing exploitation and exploration. However, two bottlenecks still limit further performance improvement: (1) non-stationary Q-value estimation caused by jointly injecting entropy and updating its weighting parameter, i.e., temperature; and (2) short-sighted local entropy tuning that adjusts temperature only according to the current single-step entropy, without considering the effect of cumulative entropy over time. In this paper, we extends maximum entropy framework by proposing a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges. Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates. Then, the dedicated entropy Q-function, explicitly quantifying the expected cumulative entropy, enables us to enforce a trajectory entropy constraint and consequently control the policy long-term stochasticity. Building on this TECRL framework, we develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements (DSAC-T). Empirical results on the OpenAI Gym benchmark demonstrate that our DSAC-E can achieve higher returns and better stability.

Paper Structure

This paper contains 39 sections, 1 theorem, 25 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

Let $(X, d)$ be a complete metric space, and let $\mathcal{B}: X \to X$ be a $\gamma$-contraction mapping with $0 < \gamma < 1$. This means that for all $x, y \in X$, where $d$ is the metric on $X$. According to Banach’s fixed-point theorem, $\mathcal{B}$ has a unique fixed point $x^* \in X$, such that $\mathcal{B}(x^*) = x^*$. Furthermore, for any initial point $x_0 \in X$, the iterative sequenc

Figures (6)

  • Figure 1: Comparison between standard maximum entropy RL (left) and our trajectory entropy-constrained (TEC) RL (right). Our TECRL framework comprises four key components: a reward-centric policy evaluation (PEV), an entropy-centric policy introspection (PIS), a policy improvement (PIM) that retains the exact soft policy objective, and a temperature update (TUP) tuning the temperature guided by the trajectory entropy constraint.
  • Figure 2: Training curves on benchmarks. The solid lines correspond to mean and shaded regions correspond to the 95% confidence interval over five runs.
  • Figure 3: Ablation on the TEC and RES.
  • Figure 4: Ablation on the sensitivity to the trajectory entropy budget.
  • Figure 5: Benchmarks. (a) Humanoid-v3: $(s \times a) \in \mathbb{R}^{376} \times \mathbb{R}^{17}$. (b) Ant-v3: $(s \times a) \in \mathbb{R}^{111} \times \mathbb{R}^{8}$. (c) HalfCheetah-v3: $(s \times a) \in \mathbb{R}^{17} \times \mathbb{R}^{6}$. (d) Walker2d-v3: $(s \times a) \in \mathbb{R}^{17} \times \mathbb{R}^{6}$. (e) Hopper-v3: $(s \times a) \in \mathbb{R}^{11} \times \mathbb{R}^{3}$. (f) InvertedDoublePendulum-v2: $(s \times a) \in \mathbb{R}^{6} \times \mathbb{R}^{1}$. (g) Reacher-v2: $(s \times a) \in \mathbb{R}^{11} \times \mathbb{R}^{2}$. (h) Swimmer-v3: $(s \times a) \in \mathbb{R}^{8} \times \mathbb{R}^{2}$.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Lemma 1: Convergence of $\gamma$-Contraction Mappings