Table of Contents
Fetching ...

Rational Neural Networks have Expressivity Advantages

Maosen Tang, Alex Townsend

TL;DR

The paper shows that trainable low-degree rational activations confer expressivity and parameter-efficiency advantages over common fixed activations. By leveraging classical rational approximation theory, the authors prove that rational networks can approximate non-smooth and sharply varying targets with far fewer parameters than smooth-activation networks, and extend these results from scalars to full architectures, including transformers. They provide constructive proofs and rigorous size bounds, demonstrating both forward (rational approximating smooth) and reverse (smooth approximating rational) directions, with extensions to gated activations and normalization considerations. Empirically, rational activations improve performance and convergence on CIFAR-10, offline MuJoCo RL, and Tiny ImageNet across multiple model families, while showing nuanced interactions with normalization that can motivate normalization-free variants. The work suggests rational activations as a principled, efficient foundation for modern deep learning, particularly in regimes with non-smooth structure or multiscale dynamics.

Abstract

We study neural networks with trainable low-degree rational activation functions and show that they are more expressive and parameter-efficient than modern piecewise-linear and smooth activations such as ELU, LeakyReLU, LogSigmoid, PReLU, ReLU, SELU, CELU, Sigmoid, SiLU, Mish, Softplus, Tanh, Softmin, Softmax, and LogSoftmax. For an error target of $\varepsilon>0$, we establish approximation-theoretic separations: Any network built from standard fixed activations can be uniformly approximated on compact domains by a rational-activation network with only $\mathrm{poly}(\log\log(1/\varepsilon))$ overhead in size, while the converse provably requires $Ω(\log(1/\varepsilon))$ parameters in the worst case. This exponential gap persists at the level of full networks and extends to gated activations and transformer-style nonlinearities. In practice, rational activations integrate seamlessly into standard architectures and training pipelines, allowing rationals to match or outperform fixed activations under identical architectures and optimizers.

Rational Neural Networks have Expressivity Advantages

TL;DR

The paper shows that trainable low-degree rational activations confer expressivity and parameter-efficiency advantages over common fixed activations. By leveraging classical rational approximation theory, the authors prove that rational networks can approximate non-smooth and sharply varying targets with far fewer parameters than smooth-activation networks, and extend these results from scalars to full architectures, including transformers. They provide constructive proofs and rigorous size bounds, demonstrating both forward (rational approximating smooth) and reverse (smooth approximating rational) directions, with extensions to gated activations and normalization considerations. Empirically, rational activations improve performance and convergence on CIFAR-10, offline MuJoCo RL, and Tiny ImageNet across multiple model families, while showing nuanced interactions with normalization that can motivate normalization-free variants. The work suggests rational activations as a principled, efficient foundation for modern deep learning, particularly in regimes with non-smooth structure or multiscale dynamics.

Abstract

We study neural networks with trainable low-degree rational activation functions and show that they are more expressive and parameter-efficient than modern piecewise-linear and smooth activations such as ELU, LeakyReLU, LogSigmoid, PReLU, ReLU, SELU, CELU, Sigmoid, SiLU, Mish, Softplus, Tanh, Softmin, Softmax, and LogSoftmax. For an error target of , we establish approximation-theoretic separations: Any network built from standard fixed activations can be uniformly approximated on compact domains by a rational-activation network with only overhead in size, while the converse provably requires parameters in the worst case. This exponential gap persists at the level of full networks and extends to gated activations and transformer-style nonlinearities. In practice, rational activations integrate seamlessly into standard architectures and training pipelines, allowing rationals to match or outperform fixed activations under identical architectures and optimizers.
Paper Structure (23 sections, 11 theorems, 136 equations, 13 figures, 10 tables)

This paper contains 23 sections, 11 theorems, 136 equations, 13 figures, 10 tables.

Key Result

Theorem 3.1

For any $0<\varepsilon<1$, there exists a rational neural network $R:[-1,1]\to[-1,1]$ of size such that Moreover, no rational neural network of size $o(\log\log(1/\varepsilon))$ can achieve this accuracy.

Figures (13)

  • Figure 1: CIFAR-10 test-accuracy score curves for VGG4 in the plain setting (left) and the boosted setting without GroupNorm (right). Curves show mean across five seeds with $\pm 1$ standard-deviation shading. Rational converges earlier and to a better final accuracy than the fixed activations in both settings.
  • Figure 2: Example learned Rational shape in VGG8 (representative feature-layer snapshot). Curves show the learned scalar nonlinearity at epochs 0, 5, 30, and 60; the light shaded overlay is the empirical density of the corresponding layer's pre-activation inputs estimated from held-out mini-batches and plotted on a secondary axis. The Rational quickly reallocates curvature to the high-density input region and develops a localized non-monotone gating shape.
  • Figure 3: Offline RL learning curves under Gymnasium v5 evaluation. Left: IQL on HalfCheetah-medium. Right: TD3+BC on Walker2d-medium. Curves plot v5-normalized score versus gradient updates in MuJoCo using Minari datasets. Lines show mean across five seeds with one standard deviation shading.
  • Figure 4: Example learned Rational shape in offline RL. Representative Rational layer from the IQL actor on HalfCheetah-medium with Rational initialized to ReLU. Curves show the learned scalar nonlinearity at selected training checkpoints and the light shaded overlay is the empirical density of the corresponding layer pre-activation inputs estimated from offline mini-batches. Supplementary material provides broader layerwise snapshots across IQL networks.
  • Figure 5: EMA validation top-1 accuracy versus epoch on Tiny-ImageNet for CaiT-S24, Swin-Tiny, and ViT-Small under different activations. The inset zooms into the final epochs to highlight late-training differences.
  • ...and 8 more figures

Theorems & Definitions (18)

  • Theorem 3.1: Rational approximation of GELU
  • Theorem 3.2: Lower bound for GELU approximation of rationals
  • Theorem 3.3: GELU approximation of rational functions
  • Theorem 3.4: Network-level approximation equivalence
  • Lemma 1.1: Composite $p$th root on a fixed interval
  • proof
  • Lemma 1.2: AGM-theta approximation of $\log$ away from $1$
  • proof
  • Lemma 1.3: Halley-based approximation of $\tanh$ on a small interval
  • proof
  • ...and 8 more