Table of Contents
Fetching ...

An Alternative Softmax Operator for Reinforcement Learning

Kavosh Asadi, Michael L. Littman

TL;DR

The paper tackles instability in Boltzmann softmax control within reinforcement learning by introducing mellowmax, a differentiable, non-expansion softmax operator that interpolates between max and mean and has a principled, information-theoretic motivation. It proves key properties of mm_\omega, including non-expansion, differentiability, and limit behaviors that recover max and mean, and derives a maximum-entropy mellowmax policy with a state-dependent temperature. A convergent SARSA variant using this policy is developed, and empirical results across random MDPs, a taxi domain, and Lunar Lander demonstrate improved stability and competitive performance relative to Boltzmann. The work suggests mellowmax as a robust alternative for both planning and on-policy learning, with potential extensions to function approximation and inverse RL.

Abstract

A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one's weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice.

An Alternative Softmax Operator for Reinforcement Learning

TL;DR

The paper tackles instability in Boltzmann softmax control within reinforcement learning by introducing mellowmax, a differentiable, non-expansion softmax operator that interpolates between max and mean and has a principled, information-theoretic motivation. It proves key properties of mm_\omega, including non-expansion, differentiability, and limit behaviors that recover max and mean, and derives a maximum-entropy mellowmax policy with a state-dependent temperature. A convergent SARSA variant using this policy is developed, and empirical results across random MDPs, a taxi domain, and Lunar Lander demonstrate improved stability and competitive performance relative to Boltzmann. The work suggests mellowmax as a robust alternative for both planning and on-policy learning, with potential extensions to function approximation and inverse RL.

Abstract

A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one's weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice.

Paper Structure

This paper contains 17 sections, 25 equations, 11 figures, 2 algorithms.

Figures (11)

  • Figure 1:
  • Figure 2: Values estimated by SARSA with Boltzmann softmax. The algorithm never achieves stable values.
  • Figure 3: $\max$ is a non-expansion under the infinity norm.
  • Figure 4: Fixed points of GVI under $\hbox{\rm boltz}_\beta$ for varying $\beta$. Two distinct fixed points (red and blue) co-exist for a range of $\beta$.
  • Figure 5: A vector field showing GVI updates under $\hbox{\rm boltz}_{\beta=16.55}$. Fixed points are marked in black. For some points, such as the large blue point, updates can move the current estimates farther from the fixed points. Also, for points that lie in between the two fixed-points, progress is extremely slow.
  • ...and 6 more figures