Table of Contents
Fetching ...

Hadamard Representations: Augmenting Hyperbolic Tangents in RL

Jacob E. Kooi, Mark Hoogendoorn, Vincent François-Lavet

TL;DR

The paper investigates dying neurons in RL and shows that continuously differentiable activations like tanh suffer from saturation similar to ReLU, limiting performance. It introduces Hadamard representations, where a hidden layer is the Hadamard product of two parallel activations, to preserve gradient flow and reduce dead neurons for tanh. Empirical results across DQN, PPO, and PQN on Atari demonstrate faster learning, higher effective rank, and substantial performance gains with tanh HR, while HR does not help ReLU. The work highlights a path to leverage smooth activations in RL and suggests broader exploration of Hadamard-style architectures and activation combos, albeit with increased parameter counts and activation-dependent effects.

Abstract

Activation functions are one of the key components of a deep neural network. The most commonly used activation functions can be classed into the category of continuously differentiable (e.g. tanh) and piece-wise linear functions (e.g. ReLU), both having their own strengths and drawbacks with respect to downstream performance and representation capacity through learning. In reinforcement learning, the performance of continuously differentiable activations often falls short as compared to piece-wise linear functions. We show that the dying neuron problem in RL is not exclusive to ReLUs and actually leads to additional problems in the case of continuously differentiable activations such as tanh. To alleviate the dying neuron problem with these activations, we propose a Hadamard representation that unlocks the advantages of continuously differentiable activations. Using DQN, PPO and PQN in the Atari domain, we show faster learning, a reduction in dead neurons and increased effective rank.

Hadamard Representations: Augmenting Hyperbolic Tangents in RL

TL;DR

The paper investigates dying neurons in RL and shows that continuously differentiable activations like tanh suffer from saturation similar to ReLU, limiting performance. It introduces Hadamard representations, where a hidden layer is the Hadamard product of two parallel activations, to preserve gradient flow and reduce dead neurons for tanh. Empirical results across DQN, PPO, and PQN on Atari demonstrate faster learning, higher effective rank, and substantial performance gains with tanh HR, while HR does not help ReLU. The work highlights a path to leverage smooth activations in RL and suggests broader exploration of Hadamard-style architectures and activation combos, albeit with increased parameter counts and activation-dependent effects.

Abstract

Activation functions are one of the key components of a deep neural network. The most commonly used activation functions can be classed into the category of continuously differentiable (e.g. tanh) and piece-wise linear functions (e.g. ReLU), both having their own strengths and drawbacks with respect to downstream performance and representation capacity through learning. In reinforcement learning, the performance of continuously differentiable activations often falls short as compared to piece-wise linear functions. We show that the dying neuron problem in RL is not exclusive to ReLUs and actually leads to additional problems in the case of continuously differentiable activations such as tanh. To alleviate the dying neuron problem with these activations, we propose a Hadamard representation that unlocks the advantages of continuously differentiable activations. Using DQN, PPO and PQN in the Atari domain, we show faster learning, a reduction in dead neurons and increased effective rank.
Paper Structure (27 sections, 1 theorem, 15 equations, 25 figures, 6 tables)

This paper contains 27 sections, 1 theorem, 15 equations, 25 figures, 6 tables.

Key Result

Theorem 4.1

When any set of neurons $\alpha^{j}$ in a hidden layer $z^{j}$ collapses into nonzero values, the output to the next layer effectively changes from ($A^{j}z^{j} +B^{j}$) to ($A_{-*}^{j}z_{i-*}^{j} +B^{j+1}$ + $A_{*}^{j}z_{i*}^{j}$), where $A_{-*}^{j}z_{-*}^{j}$ represent the active neurons multiplie

Figures (25)

  • Figure 1: Median Human-Normalized performance training PQN in the Atari domain where the activation function of the hidden layers is changed. In PQN, all activations are already layer-normalized. A massive performance discrepancy in performance can be observed when selecting different activation functions. Notably, in Atari, the application of a Hadamard representation with hyperbolic tangent leads to over 100% performance gains. The Hadamard representation is not suitable for the ReLU activation, as it amplifies dying neurons by taking the product of sparse activations.
  • Figure 2: A regression of three shallow neural network architectures on a random complex sinusoidal function ($y = 10 * torch.sin(7 * x) + 15 * torch.sin(10 * x) + 5 * torch.cos(5* x)$). The Tanh (HR) network emerges as the strongest function approximator, even while having less trainable parameters (501 vs 601 for Tanh & ReLU). To make a fair comparison, the Tanh and ReLU networks have one single hidden layer of size 200, while the Tanh (HR) network has a hidden layer of size 100. For the Tanh (HR) network however, we use two parallel linear layers preceding the hidden layer in order to be able to use the single hidden layer as the Hadamard product of two activations (see Section \ref{['sec:algo']}). For experiments comparing deeper networks, we refer the reader to Appendix \ref{['app:function_approximation']}.
  • Figure 3: Kernel Density Estimations (KDE) over a subset of 16 neurons in the compressed representation $z_{t}$ after training DQN Mnih2015Human-levelLearning in the Breakout environment using a hyperbolic tangent activation for $z_{t}$. Each neuron represents one dimension of the representation $z_{t} \in \mathbb{R}^{512}$. Red outlines represent dying neurons, where a near infinite sized density spike occurs at either 1 or -1.
  • Figure 4: 10M iterations (40M frames) training DQN with a hyperbolic tangent activation in the Seaquest environment. The average contribution to the Q-values of the live and dead neurons in the final hidden layer is observed. If a neuron dies, it retains the same value for any input observation, but a multiplication of the nonzero saturation value with its outgoing weights implements a substantial 'hidden' bias on the Q-values.
  • Figure 5: A visualisation of the Hadamard representation. Horizontal bars represent weight vectors and $\boldsymbol{z}_{t}$ represents a hidden layer. Between each hidden layer, two parallel independently parameterized activation layers are formed, where the Hadamard product of the two activation layers represents the actual propagated hidden layer.
  • ...and 20 more figures

Theorems & Definitions (2)

  • Theorem 4.1
  • proof