Table of Contents
Fetching ...

PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants

Mingkun Yu, Heming Zhong, Dan Huang, Yutong Lu, Jiazhi Jiang

TL;DR

This work tackles the practical bottleneck of low GPU utilization in polynomial Kolmogorov-Arnold Networks (KANs) by introducing PolyKAN, an open-source GPU operator library that fuses forward and backward passes for KAN variants. It presents a general optimization pipeline—LUT-based basis evaluation, 2D tiling, two-stage reduction, and coefficient-layout reordering—that is variant-agnostic and applicable to Chebyshev, Fourier, Legendre, and related bases. Across extensive experiments on A100 and consumer GPUs, PolyKAN achieves 1.2–12× faster training and inference compared with baseline approaches, while preserving accuracy; end-to-end workloads show substantial epoch-time reductions and improved throughput. The solution generalizes beyond Chebyshev to other polynomial bases and demonstrates portability, offering a practical path to deploying interpretable, polynomial-based networks in AI for Science.

Abstract

Kolmogorov-Arnold Networks (KANs) promise higher expressive capability and stronger interpretability than Multi-Layer Perceptron, particularly in the domain of AI for Science. However, practical adoption has been hindered by low GPU utilization of existing parallel implementations. To address this challenge, we present a GPU-accelerated operator library, named PolyKAN which is the first general open-source implementation of KAN and its variants. PolyKAN fuses the forward and backward passes of polynomial KAN layers into a concise set of optimized CUDA kernels. Four orthogonal techniques underpin the design: (i) \emph{lookup-table} with linear interpolation that replaces runtime expensive math-library functions; (ii) \emph{2D tiling} to expose thread-level parallelism with preserving memory locality; (iii) a \emph{two-stage reduction} scheme converting scattered atomic updates into a single controllable merge step; and (iv) \emph{coefficient-layout reordering} yielding unit-stride reads under the tiled schedule. Using a KAN variant, Chebyshev KAN, as a case-study, PolyKAN delivers $1.2$--$10\times$ faster inference and $1.4$--$12\times$ faster training than a Triton + cuBLAS baseline, with identical accuracy on speech, audio-enhancement, and tabular-regression workloads on both highend GPU and consumer-grade GPU.

PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants

TL;DR

This work tackles the practical bottleneck of low GPU utilization in polynomial Kolmogorov-Arnold Networks (KANs) by introducing PolyKAN, an open-source GPU operator library that fuses forward and backward passes for KAN variants. It presents a general optimization pipeline—LUT-based basis evaluation, 2D tiling, two-stage reduction, and coefficient-layout reordering—that is variant-agnostic and applicable to Chebyshev, Fourier, Legendre, and related bases. Across extensive experiments on A100 and consumer GPUs, PolyKAN achieves 1.2–12× faster training and inference compared with baseline approaches, while preserving accuracy; end-to-end workloads show substantial epoch-time reductions and improved throughput. The solution generalizes beyond Chebyshev to other polynomial bases and demonstrates portability, offering a practical path to deploying interpretable, polynomial-based networks in AI for Science.

Abstract

Kolmogorov-Arnold Networks (KANs) promise higher expressive capability and stronger interpretability than Multi-Layer Perceptron, particularly in the domain of AI for Science. However, practical adoption has been hindered by low GPU utilization of existing parallel implementations. To address this challenge, we present a GPU-accelerated operator library, named PolyKAN which is the first general open-source implementation of KAN and its variants. PolyKAN fuses the forward and backward passes of polynomial KAN layers into a concise set of optimized CUDA kernels. Four orthogonal techniques underpin the design: (i) \emph{lookup-table} with linear interpolation that replaces runtime expensive math-library functions; (ii) \emph{2D tiling} to expose thread-level parallelism with preserving memory locality; (iii) a \emph{two-stage reduction} scheme converting scattered atomic updates into a single controllable merge step; and (iv) \emph{coefficient-layout reordering} yielding unit-stride reads under the tiled schedule. Using a KAN variant, Chebyshev KAN, as a case-study, PolyKAN delivers -- faster inference and -- faster training than a Triton + cuBLAS baseline, with identical accuracy on speech, audio-enhancement, and tabular-regression workloads on both highend GPU and consumer-grade GPU.

Paper Structure

This paper contains 34 sections, 1 theorem, 7 equations, 9 figures, 6 tables, 2 algorithms.

Key Result

theorem 1

Let $f: [0,1]^n \to \mathbb{R}$ be an arbitrary continuous function. Then there exist continuous single-variable functions such that, for all $x = (x_1, x_2, \ldots, x_n) \in [0,1]^n$,

Figures (9)

  • Figure 1: Architectural and theoretical comparison between traditional multi-layer perceptron (MLP) and Kolmogorov-Arnold Network (KAN).
  • Figure 2: The structure of the Kolmogorov-Arnold network.
  • Figure 3: Replacing a conventional feed‑forward layer in deep‑learning models with a ChebyKAN layer: the input $X$ is mapped elementwise by the Chebyshev‑polynomial basis, producing the basis tensor $T$ with entries $T_{b,j,d}=T_d(tanh(X_{b,j}))$. The tensor $T$ is then linearly contracted with the learnable coefficient tensor $C$ to yield the output $Y$.
  • Figure 4: The roofline model of the Kolmogorov-Arnold network.
  • Figure 5: The overall design of the KAN variant acceleration.
  • ...and 4 more figures

Theorems & Definitions (1)

  • theorem 1: Kolmogorov-Arnold Theorem Kolmogorov