Table of Contents
Fetching ...

Polynomial, trigonometric, and tropical activations

Ismail Khalfaoui-Hassani, Stefan Kesselheim

TL;DR

This work introduces variance-preserving initializations for learnable activations based on orthogonal bases (Hermite and Fourier) and tropical polynomials, enabling stable training of deep networks with activations that render the network a multivariate polynomial mapping. By deriving closed-form second-moment expressions and balancing forward/backward gains ($\alpha=\alpha'$), the authors demonstrate successful training of large models (e.g., ConvNeXt on ImageNet, GPT-2 on OpenWebText) without extra regularization. Hermite interpolation further allows close matching to classical activations during finetuning, and tropical activations offer a FLOP-efficient alternative, broadening the scope of activations beyond traditional ReLU/GELU. Overall, the approach provides both theoretical insight and practical avenues for improving large-scale learning efficiency and interpretability, with torchortho as an implementation resource.

Abstract

Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks with polynomial activations can be interpreted as multivariate polynomial mappings. Finally, using Hermite interpolation, we show that our activations can closely approximate classical ones in pre-trained models by matching both the function and its derivative, making them especially useful for fine-tuning tasks. These activations are available in the torchortho library, which can be accessed via: https://github.com/K-H-Ismail/torchortho.

Polynomial, trigonometric, and tropical activations

TL;DR

This work introduces variance-preserving initializations for learnable activations based on orthogonal bases (Hermite and Fourier) and tropical polynomials, enabling stable training of deep networks with activations that render the network a multivariate polynomial mapping. By deriving closed-form second-moment expressions and balancing forward/backward gains (), the authors demonstrate successful training of large models (e.g., ConvNeXt on ImageNet, GPT-2 on OpenWebText) without extra regularization. Hermite interpolation further allows close matching to classical activations during finetuning, and tropical activations offer a FLOP-efficient alternative, broadening the scope of activations beyond traditional ReLU/GELU. Overall, the approach provides both theoretical insight and practical avenues for improving large-scale learning efficiency and interpretability, with torchortho as an implementation resource.

Abstract

Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks with polynomial activations can be interpreted as multivariate polynomial mappings. Finally, using Hermite interpolation, we show that our activations can closely approximate classical ones in pre-trained models by matching both the function and its derivative, making them especially useful for fine-tuning tasks. These activations are available in the torchortho library, which can be accessed via: https://github.com/K-H-Ismail/torchortho.

Paper Structure

This paper contains 35 sections, 16 theorems, 74 equations, 17 figures, 8 tables, 6 algorithms.

Key Result

Theorem 3.8

Variance-preserving coefficient initialization of Hermite activation. Let Then using this initialization, the forward and backward gains become the same and are equal to:

Figures (17)

  • Figure 1: Fitting a GELU with a Hermite Activation of degree 3 (left) and of degree 8 (right).
  • Figure 2: Lagrange interpolation (left) and Hermite interpolation (right) of a GELU with a Fourier Activation of degree 6.
  • Figure 3: Decision boundaries for different activation functions
  • Figure 4: A classical MLP (linear + ReLU) vs Basis-MLP (linear + learnable basis function) blocks.
  • Figure 5: Hermite interpolation of a GELU with a Tropical Rational Activation of degree 6 in both the nominator and the denominator.
  • ...and 12 more figures

Theorems & Definitions (46)

  • Definition 3.2
  • Definition 3.3
  • Definition 3.5
  • Definition 3.7
  • Theorem 3.8
  • proof
  • Corollary 3.9
  • Remark 3.10
  • Definition 3.12
  • Theorem 3.13
  • ...and 36 more