An Alternative Softmax Operator for Reinforcement Learning
Kavosh Asadi, Michael L. Littman
TL;DR
The paper tackles instability in Boltzmann softmax control within reinforcement learning by introducing mellowmax, a differentiable, non-expansion softmax operator that interpolates between max and mean and has a principled, information-theoretic motivation. It proves key properties of mm_\omega, including non-expansion, differentiability, and limit behaviors that recover max and mean, and derives a maximum-entropy mellowmax policy with a state-dependent temperature. A convergent SARSA variant using this policy is developed, and empirical results across random MDPs, a taxi domain, and Lunar Lander demonstrate improved stability and competitive performance relative to Boltzmann. The work suggests mellowmax as a robust alternative for both planning and on-policy learning, with potential extensions to function approximation and inverse RL.
Abstract
A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one's weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice.
