Table of Contents
Fetching ...

PowerMLP: An Efficient Version of KAN

Ruichen Qiu, Yibo Miao, Shiwen Wang, Lijia Yu, Yifan Zhu, Xiao-Shan Gao

TL;DR

The paper tackles the slow training and inference associated with Kolmogorov-Arnold Networks (KAN) by introducing PowerMLP, an MLP-like architecture that uses a non-iterative spline representation via $k$-th power ReLU activations and a basis function. The authors prove that PowerMLP defines the same function space as KAN over bounded intervals and a strictly larger space over the real line, while achieving significantly lower FLOPs than KAN. Empirically, PowerMLP attains higher accuracy on many tasks and trains roughly 40 times faster than KAN across AI-for-science, ML, NLP, and CV benchmarks. This yields a practical, efficient alternative to KAN with broad applicability as a drop-in architectural substitute while retaining or enhancing expressiveness. The work also provides a theoretical bridge between spline-based activation schemes and polynomial activations, underpinning the transferability of PowerMLP into existing architectures like CNNs and Transformers.

Abstract

The Kolmogorov-Arnold Network (KAN) is a new network architecture known for its high accuracy in several tasks such as function fitting and PDE solving. The superior expressive capability of KAN arises from the Kolmogorov-Arnold representation theorem and learnable spline functions. However, the computation of spline functions involves multiple iterations, which renders KAN significantly slower than MLP, thereby increasing the cost associated with model training and deployment. The authors of KAN have also noted that ``the biggest bottleneck of KANs lies in its slow training. KANs are usually 10x slower than MLPs, given the same number of parameters.'' To address this issue, we propose a novel MLP-type neural network PowerMLP that employs simpler non-iterative spline function representation, offering approximately the same training time as MLP while theoretically demonstrating stronger expressive power than KAN. Furthermore, we compare the FLOPs of KAN and PowerMLP, quantifying the faster computation speed of PowerMLP. Our comprehensive experiments demonstrate that PowerMLP generally achieves higher accuracy and a training speed about 40 times faster than KAN in various tasks.

PowerMLP: An Efficient Version of KAN

TL;DR

The paper tackles the slow training and inference associated with Kolmogorov-Arnold Networks (KAN) by introducing PowerMLP, an MLP-like architecture that uses a non-iterative spline representation via -th power ReLU activations and a basis function. The authors prove that PowerMLP defines the same function space as KAN over bounded intervals and a strictly larger space over the real line, while achieving significantly lower FLOPs than KAN. Empirically, PowerMLP attains higher accuracy on many tasks and trains roughly 40 times faster than KAN across AI-for-science, ML, NLP, and CV benchmarks. This yields a practical, efficient alternative to KAN with broad applicability as a drop-in architectural substitute while retaining or enhancing expressiveness. The work also provides a theoretical bridge between spline-based activation schemes and polynomial activations, underpinning the transferability of PowerMLP into existing architectures like CNNs and Transformers.

Abstract

The Kolmogorov-Arnold Network (KAN) is a new network architecture known for its high accuracy in several tasks such as function fitting and PDE solving. The superior expressive capability of KAN arises from the Kolmogorov-Arnold representation theorem and learnable spline functions. However, the computation of spline functions involves multiple iterations, which renders KAN significantly slower than MLP, thereby increasing the cost associated with model training and deployment. The authors of KAN have also noted that ``the biggest bottleneck of KANs lies in its slow training. KANs are usually 10x slower than MLPs, given the same number of parameters.'' To address this issue, we propose a novel MLP-type neural network PowerMLP that employs simpler non-iterative spline function representation, offering approximately the same training time as MLP while theoretically demonstrating stronger expressive power than KAN. Furthermore, we compare the FLOPs of KAN and PowerMLP, quantifying the faster computation speed of PowerMLP. Our comprehensive experiments demonstrate that PowerMLP generally achieves higher accuracy and a training speed about 40 times faster than KAN in various tasks.

Paper Structure

This paper contains 34 sections, 13 theorems, 41 equations, 8 figures, 4 tables.

Key Result

Lemma 1

If $t_u \neq t_v (\forall u \neq v)$, then the $k$-order B-spline on the knot sequence $t=(t_j, \cdots, t_{j+k+1})$ can be represented as a linear combination of $\sigma_{k}$ functions:

Figures (8)

  • Figure 1: PowerMLPs define a strictly larger function space than KANs over $\mathbb{R}^n$ (Corollary \ref{['cor-k2p1']}), and define the same function space over $[-E,E]^n$ for any $E\in\mathbb{R}_+$ (Corollary \ref{['cor-p2k']}), where $n$ is the input dimension. $\mathcal{P}_{d,w,k,p}$ is the set of all PowerMLP networks with depth $d$, width $w$, $k$-th power ReLU activation function, and $p$ nonzero parameters. ${\mathcal{K}}_{d,w,k,G,p}$ is the set of all KAN networks with depth $d$, width $w$, using $(k,G)$-spline (see Eq. \ref{['eq-sfun']}), and $p$ nonzero parameters.
  • Figure 2: Structure of a 3-layer PowerMLP. The first two layers are calculated by: (1) affine transformation, (2) $k$-th power of ReLU activation, (3) addition with a basis function. The last layer contains only an affine transformation.
  • Figure 3: Represent a PowerMLP layer with a 2-layer KAN. $\delta_{ij}$ equals to $1$ if $i=j$ and $0$ otherwise. The first layer represents the affine transformation $y_q=\sum_{p=1}^n\omega_{q,p}x_p+\gamma_{q,p}$ for $1\leq q\leq m$ and keeps $y_q=x_{q-m}$ for $m+1\leq q\leq m+n$. The second layer represents the ReLU-$k$ activation and adds the basis function: $z_r=\sigma_k(y_r)+\sum_{q=m+1}^{m+n}\alpha_{r,q-m}b(y_q)$.
  • Figure 4: In the upper figure, PowerMLP can correctly find that $3$ of $17$ geometric invariants have influence on the output. Additionally, PowerMLP outperforms KAN in $15$ of $17$ input cases while KAN fails to converge with Symmetry $D_3$ or $D_8$ as input. In the bottom figure, trained on part or all of the $3$ influencing geometric invariants, PowerMLP achieves much higher test accuracy than KAN in $3$ cases.
  • Figure 5: Test accuracy of three networks on multiple classification tasks.
  • ...and 3 more figures

Theorems & Definitions (21)

  • Definition 1: B-spline
  • Definition 2: Spline Function
  • Definition 3: PowerMLP
  • Lemma 1: Represent the B-spline with powers of ReLU
  • Theorem 2: KAN is a Subset of PowerMLP
  • Corollary 3
  • Lemma 4: Affine Transformation
  • Lemma 5: ReLU-$k$ Function
  • Theorem 6: PowerMLP is a subset of KAN over interval
  • Corollary 7
  • ...and 11 more