Table of Contents
Fetching ...

Decoupling regularization from the action space

Sobhan Mohammadpour, Emma Frejinger, Pierre-Luc Bacon

TL;DR

This work shows that entropy-based regularization in RL is not invariant to action-space size, leading to over-regularization as the number of actions grows. It introduces decoupled regularizers with constant range and two temperature schemes (static and dynamic) to remove this dependence, enabling scale-invariant regularized MDPs. The approach is instantiated as Decoupled SQL and extended with automatic temperature rules that tie target entropy to the regularizer range, improving stability and performance on the DeepMind Control Suite and a drug-design MDP with GFlowNets. Overall, the method enhances robustness to state-dependent action spaces and has practical impact for molecular design tasks and other domains requiring flexible action sets.

Abstract

Regularized reinforcement learning (RL), particularly the entropy-regularized kind, has gained traction in optimal control and inverse RL. While standard unregularized RL methods remain unaffected by changes in the number of actions, we show that it can severely impact their regularized counterparts. This paper demonstrates the importance of decoupling the regularizer from the action space: that is, to maintain a consistent level of regularization regardless of how many actions are involved to avoid over-regularization. Whereas the problem can be avoided by introducing a task-specific temperature parameter, it is often undesirable and cannot solve the problem when action spaces are state-dependent. In the state-dependent action context, different states with varying action spaces are regularized inconsistently. We introduce two solutions: a static temperature selection approach and a dynamic counterpart, universally applicable where this problem arises. Implementing these changes improves performance on the DeepMind control suite in static and dynamic temperature regimes and a biological sequence design task.

Decoupling regularization from the action space

TL;DR

This work shows that entropy-based regularization in RL is not invariant to action-space size, leading to over-regularization as the number of actions grows. It introduces decoupled regularizers with constant range and two temperature schemes (static and dynamic) to remove this dependence, enabling scale-invariant regularized MDPs. The approach is instantiated as Decoupled SQL and extended with automatic temperature rules that tie target entropy to the regularizer range, improving stability and performance on the DeepMind Control Suite and a drug-design MDP with GFlowNets. Overall, the method enhances robustness to state-dependent action spaces and has practical impact for molecular design tasks and other domains requiring flexible action sets.

Abstract

Regularized reinforcement learning (RL), particularly the entropy-regularized kind, has gained traction in optimal control and inverse RL. While standard unregularized RL methods remain unaffected by changes in the number of actions, we show that it can severely impact their regularized counterparts. This paper demonstrates the importance of decoupling the regularizer from the action space: that is, to maintain a consistent level of regularization regardless of how many actions are involved to avoid over-regularization. Whereas the problem can be avoided by introducing a task-specific temperature parameter, it is often undesirable and cannot solve the problem when action spaces are state-dependent. In the state-dependent action context, different states with varying action spaces are regularized inconsistently. We introduce two solutions: a static temperature selection approach and a dynamic counterpart, universally applicable where this problem arises. Implementing these changes improves performance on the DeepMind control suite in static and dynamic temperature regimes and a biological sequence design task.
Paper Structure (13 sections, 5 theorems, 5 equations, 33 figures, 1 table, 2 algorithms)

This paper contains 13 sections, 5 theorems, 5 equations, 33 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Under Assumption ass:2, the supremum of the regularizer is equal to the limit of the regularizer at a deterministic distribution (i.e., only one action has non-zero probability, and the others have zero probability).

Figures (33)

  • Figure 1: Toy MDPs
  • Figure 2: Episode length at different dimensions of the hypergrid problem.
  • Figure 3: Test reward on the DMC benchmark with $\tau=0.25$. The X axis is the number of iterations divided by $1e6$.
  • Figure 4: Test reward on the DMC benchmark with automatic temperature. The x-axis is the number of iterations divided by $1e6$.
  • Figure 5: The left plot is the median reward of each batch. The right is the number of modes found. The shaded area shows the interquartile range, and the heavy line shows the interquartile mean.
  • ...and 28 more figures

Theorems & Definitions (19)

  • Definition 1
  • Example 1
  • proof
  • Example 2
  • proof
  • Definition 2
  • Definition 3
  • Lemma 1
  • proof
  • Lemma 2
  • ...and 9 more