Table of Contents
Fetching ...

The Resurrection of the ReLU

Coşku Can Horuz, Geoffrey Kasenbacher, Saya Higuchi, Sebastian Kairat, Jendrik Stoltz, Moritz Pesl, Bernhard A. Moser, Christoph Linse, Thomas Martinetz, Sebastian Otte

TL;DR

This work tackles the dying ReLU problem by introducing SUGAR, a plug-and-play regularizer that keeps ReLU in the forward pass while replacing its backward gradient with a smooth surrogate via Forward Gradient Injection. By designing two surrogates, B-SiLU and NeLU, and evaluating across architectures from VGG-16/ResNet-18 to Conv2NeXt and Swin Transformer, the approach yields improved generalization and sparser activations, effectively reviving dead neurons. The results show substantial gains on CIFAR benchmarks and competitive performance on modern vision models, while offering insights into activation distributions and loss landscapes. The work suggests that well-chosen surrogate gradients can enhance ReLU-based networks without abandoning the simplicity and sparsity advantages of ReLU, with potential implications for efficiency and regularization, though it acknowledges limitations and domain-specific effects.

Abstract

Modeling sophisticated activation functions within deep learning architectures has evolved into a distinct research direction. Functions such as GELU, SELU, and SiLU offer smooth gradients and improved convergence properties, making them popular choices in state-of-the-art models. Despite this trend, the classical ReLU remains appealing due to its simplicity, inherent sparsity, and other advantageous topological characteristics. However, ReLU units are prone to becoming irreversibly inactive - a phenomenon known as the dying ReLU problem - which limits their overall effectiveness. In this work, we introduce surrogate gradient learning for ReLU (SUGAR) as a novel, plug-and-play regularizer for deep architectures. SUGAR preserves the standard ReLU function during the forward pass but replaces its derivative in the backward pass with a smooth surrogate that avoids zeroing out gradients. We demonstrate that SUGAR, when paired with a well-chosen surrogate function, substantially enhances generalization performance over convolutional network architectures such as VGG-16 and ResNet-18, providing sparser activations while effectively resurrecting dead ReLUs. Moreover, we show that even in modern architectures like Conv2NeXt and Swin Transformer - which typically employ GELU - substituting these with SUGAR yields competitive and even slightly superior performance. These findings challenge the prevailing notion that advanced activation functions are necessary for optimal performance. Instead, they suggest that the conventional ReLU, particularly with appropriate gradient handling, can serve as a strong, versatile revived classic across a broad range of deep learning vision models.

The Resurrection of the ReLU

TL;DR

This work tackles the dying ReLU problem by introducing SUGAR, a plug-and-play regularizer that keeps ReLU in the forward pass while replacing its backward gradient with a smooth surrogate via Forward Gradient Injection. By designing two surrogates, B-SiLU and NeLU, and evaluating across architectures from VGG-16/ResNet-18 to Conv2NeXt and Swin Transformer, the approach yields improved generalization and sparser activations, effectively reviving dead neurons. The results show substantial gains on CIFAR benchmarks and competitive performance on modern vision models, while offering insights into activation distributions and loss landscapes. The work suggests that well-chosen surrogate gradients can enhance ReLU-based networks without abandoning the simplicity and sparsity advantages of ReLU, with potential implications for efficiency and regularization, though it acknowledges limitations and domain-specific effects.

Abstract

Modeling sophisticated activation functions within deep learning architectures has evolved into a distinct research direction. Functions such as GELU, SELU, and SiLU offer smooth gradients and improved convergence properties, making them popular choices in state-of-the-art models. Despite this trend, the classical ReLU remains appealing due to its simplicity, inherent sparsity, and other advantageous topological characteristics. However, ReLU units are prone to becoming irreversibly inactive - a phenomenon known as the dying ReLU problem - which limits their overall effectiveness. In this work, we introduce surrogate gradient learning for ReLU (SUGAR) as a novel, plug-and-play regularizer for deep architectures. SUGAR preserves the standard ReLU function during the forward pass but replaces its derivative in the backward pass with a smooth surrogate that avoids zeroing out gradients. We demonstrate that SUGAR, when paired with a well-chosen surrogate function, substantially enhances generalization performance over convolutional network architectures such as VGG-16 and ResNet-18, providing sparser activations while effectively resurrecting dead ReLUs. Moreover, we show that even in modern architectures like Conv2NeXt and Swin Transformer - which typically employ GELU - substituting these with SUGAR yields competitive and even slightly superior performance. These findings challenge the prevailing notion that advanced activation functions are necessary for optimal performance. Instead, they suggest that the conventional ReLU, particularly with appropriate gradient handling, can serve as a strong, versatile revived classic across a broad range of deep learning vision models.

Paper Structure

This paper contains 27 sections, 8 equations, 21 figures, 5 tables.

Figures (21)

  • Figure 1: Comparison of activation functions and their derivatives. From left to right: ReLU (dashed) and its derivative (black), ReLU activation function with B-SiLU derivative (blue) and ReLU activation function with NeLU derivative (red).
  • Figure 2: The plots show the validation loss of ResNet-18 on CIFAR-100 with and without SUGAR. In the legend, the corresponding test accuracies (of the respective functions as surrogates) are given for completeness. See \ref{['app:training_curves']} for all the convergence plots from the experiments.
  • Figure 3: Test accuracy of VGG-16 on CIFAR-100, comparing non-SUGAR (red) and SUGAR (blue) for each activation function. The black bar represents the baseline, where the model is simply trained with ReLU (forward and backward). See \ref{['app:bar_accuracies']} for all the accuracy plots from the experiments.
  • Figure 5: Loss landscapes visualized around the trained solution using different gradient flows. SUGAR (B-SiLU) smooths the optimization surface while retaining the ReLU forward pass, leading to a more stable geometry.
  • Figure 6: Exemplary comparison of predictions. Left plot shows the prediction of the plain ReLU network whereas SUGAR with B-SiLU is applied on the right.
  • ...and 16 more figures