Table of Contents
Fetching ...

Rational ANOVA Networks

Jusheng Zhang, Ningyuan Liu, Qinhan Lyu, Jing Yang, Keze Wang

TL;DR

Rational-ANOVA Networks (RAN) fuse a functional-ANOVA decomposition with Padé-style rational units to model a target function as a sum of main effects and sparse pairwise interactions. By enforcing strictly positive denominators and employing residual gating, RAN achieves stable deep optimization and improved extrapolation compared with fixed activations and spline-based alternatives. The architecture serves as a drop-in replacement for FFNs in models like Vision Transformers, yielding better accuracy-efficiency under matched budgets and enabling explicit control over interaction topology. Across visual benchmarks, large-scale ViT integrations, and real-world denoising, RAN demonstrates consistent gains and robustness, while ablations and theory explain its stability and the benefits of smart sparse connectivity. The work also highlights RAN’s potential for automated symbolic discovery and interpretable rational dynamics in scientific modeling tasks.

Abstract

Deep neural networks typically treat nonlinearities as fixed primitives (e.g., ReLU), limiting both interpretability and the granularity of control over the induced function class. While recent additive models (like KANs) attempt to address this using splines, they often suffer from computational inefficiency and boundary instability. We propose the Rational-ANOVA Network (RAN), a foundational architecture grounded in functional ANOVA decomposition and Padé-style rational approximation. RAN models f(x) as a composition of main effects and sparse pairwise interactions, where each component is parameterized by a stable, learnable rational unit. Crucially, we enforce a strictly positive denominator, which avoids poles and numerical instability while capturing sharp transitions and near-singular behaviors more efficiently than polynomial bases. This ANOVA structure provides an explicit low-order interaction bias for data efficiency and interpretability, while the rational parameterization significantly improves extrapolation. Across controlled function benchmarks and vision classification tasks (e.g., CIFAR-10) under matched parameter and compute budgets, RAN matches or surpasses parameter-matched MLPs and learnable-activation baselines, with better stability and throughput. Code is available at https://github.com/jushengzhang/Rational-ANOVA-Networks.git.

Rational ANOVA Networks

TL;DR

Rational-ANOVA Networks (RAN) fuse a functional-ANOVA decomposition with Padé-style rational units to model a target function as a sum of main effects and sparse pairwise interactions. By enforcing strictly positive denominators and employing residual gating, RAN achieves stable deep optimization and improved extrapolation compared with fixed activations and spline-based alternatives. The architecture serves as a drop-in replacement for FFNs in models like Vision Transformers, yielding better accuracy-efficiency under matched budgets and enabling explicit control over interaction topology. Across visual benchmarks, large-scale ViT integrations, and real-world denoising, RAN demonstrates consistent gains and robustness, while ablations and theory explain its stability and the benefits of smart sparse connectivity. The work also highlights RAN’s potential for automated symbolic discovery and interpretable rational dynamics in scientific modeling tasks.

Abstract

Deep neural networks typically treat nonlinearities as fixed primitives (e.g., ReLU), limiting both interpretability and the granularity of control over the induced function class. While recent additive models (like KANs) attempt to address this using splines, they often suffer from computational inefficiency and boundary instability. We propose the Rational-ANOVA Network (RAN), a foundational architecture grounded in functional ANOVA decomposition and Padé-style rational approximation. RAN models f(x) as a composition of main effects and sparse pairwise interactions, where each component is parameterized by a stable, learnable rational unit. Crucially, we enforce a strictly positive denominator, which avoids poles and numerical instability while capturing sharp transitions and near-singular behaviors more efficiently than polynomial bases. This ANOVA structure provides an explicit low-order interaction bias for data efficiency and interpretability, while the rational parameterization significantly improves extrapolation. Across controlled function benchmarks and vision classification tasks (e.g., CIFAR-10) under matched parameter and compute budgets, RAN matches or surpasses parameter-matched MLPs and learnable-activation baselines, with better stability and throughput. Code is available at https://github.com/jushengzhang/Rational-ANOVA-Networks.git.
Paper Structure (86 sections, 6 theorems, 43 equations, 8 figures, 12 tables)

This paper contains 86 sections, 6 theorems, 43 equations, 8 figures, 12 tables.

Key Result

Lemma 11.1

For any finite weight configuration $\{\mathbf{w}^P, \mathbf{w}^Q\}$, the function $\phi(x)$ is $C^\infty$-smooth on the entire domain $\mathbb{R}$. Specifically, $\phi(x)$ admits no poles.

Figures (8)

  • Figure 1: Comparison of RAN with MLPs and KANs. MLPs use fixed activations; KANs learn edge splines. RAN employs learnable rational units in a Functional ANOVA topology, decomposing $f$ into main effects ($P_i/Q_i$) and sparse interactions ($P_{ij}/Q_{ij}$).
  • Figure 2: Deep Rational-ANOVA Network (RAN) architecture.Left: A deep backbone of stacked residual blocks; each block performs sparse pairwise message passing (interactions) then node-wise updates. Right: Learnable rational units. $R_{1\text{D}}$ and $R_{2\text{D}}$ use residual gating $y=x+\alpha(R(x)-x)$ for identity initialization. Denominators are positive ($1+\text{softplus}(\cdot)$) for stability and pole-free composition.
  • Figure 3: Learning Dynamics Comparison. Similar to how RLHF/DPO dynamics affect probability mass, we visualize how structural choices affect function updates $\Delta f$. (a) MLP Entanglement: An update at $x_u$ (red arrow) causes uncontrolled shifts at distant $x_o$ (dashed arrow) due to dense kernel mixing. (b) RAN Locality (Ours): The ANOVA structure (Eq. \ref{['eq:kernel_struct']}) disentangles interactions; updates at $x_u$ leave uncorrelated regions $x_o$ stable. (c) Rational Stability: Under strong "squeezing" gradients (steep transitions), polynomials oscillate (Runge's phenomenon), while RAN's rational units (Eq. \ref{['eq:rational_jacobian_bound']}) fit smoothly due to denominator-controlled derivatives.
  • Figure 4: Real-World Denoising Efficiency on PolyU. PSNR (y-axis) vs. parameter count (x-axis, log scale). Each point corresponds to a budgeted model instance. RAN achieves a strong accuracy--efficiency trade-off and lies on the Pareto frontier.
  • Figure 5: Performance vs. Efficiency on TabArena. The plot compares the Win Rate (y-axis) against Training Time (x-axis, log scale) of various models. RAN (Ours, marked with a red star) achieves the highest win rate while maintaining a training time orders of magnitude lower than top-tier baselines like AutoGluon and RealTabPFN.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Lemma 11.1: Global Regularity and Pole-Freeness
  • proof
  • Lemma 11.3: Polynomial Derivative Bounds
  • proof
  • Theorem 11.4: Explicit Lipschitz Constant
  • proof
  • Corollary 11.5: Network-Level Jacobian Bound
  • proof
  • Proposition 12.1: Jacobian Spectrum Control
  • proof
  • ...and 2 more