Rational Neural Networks have Expressivity Advantages
Maosen Tang, Alex Townsend
TL;DR
The paper shows that trainable low-degree rational activations confer expressivity and parameter-efficiency advantages over common fixed activations. By leveraging classical rational approximation theory, the authors prove that rational networks can approximate non-smooth and sharply varying targets with far fewer parameters than smooth-activation networks, and extend these results from scalars to full architectures, including transformers. They provide constructive proofs and rigorous size bounds, demonstrating both forward (rational approximating smooth) and reverse (smooth approximating rational) directions, with extensions to gated activations and normalization considerations. Empirically, rational activations improve performance and convergence on CIFAR-10, offline MuJoCo RL, and Tiny ImageNet across multiple model families, while showing nuanced interactions with normalization that can motivate normalization-free variants. The work suggests rational activations as a principled, efficient foundation for modern deep learning, particularly in regimes with non-smooth structure or multiscale dynamics.
Abstract
We study neural networks with trainable low-degree rational activation functions and show that they are more expressive and parameter-efficient than modern piecewise-linear and smooth activations such as ELU, LeakyReLU, LogSigmoid, PReLU, ReLU, SELU, CELU, Sigmoid, SiLU, Mish, Softplus, Tanh, Softmin, Softmax, and LogSoftmax. For an error target of $\varepsilon>0$, we establish approximation-theoretic separations: Any network built from standard fixed activations can be uniformly approximated on compact domains by a rational-activation network with only $\mathrm{poly}(\log\log(1/\varepsilon))$ overhead in size, while the converse provably requires $Ω(\log(1/\varepsilon))$ parameters in the worst case. This exponential gap persists at the level of full networks and extends to gated activations and transformer-style nonlinearities. In practice, rational activations integrate seamlessly into standard architectures and training pipelines, allowing rationals to match or outperform fixed activations under identical architectures and optimizers.
