PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants

Mingkun Yu; Heming Zhong; Dan Huang; Yutong Lu; Jiazhi Jiang

PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants

Mingkun Yu, Heming Zhong, Dan Huang, Yutong Lu, Jiazhi Jiang

TL;DR

This work tackles the practical bottleneck of low GPU utilization in polynomial Kolmogorov-Arnold Networks (KANs) by introducing PolyKAN, an open-source GPU operator library that fuses forward and backward passes for KAN variants. It presents a general optimization pipeline—LUT-based basis evaluation, 2D tiling, two-stage reduction, and coefficient-layout reordering—that is variant-agnostic and applicable to Chebyshev, Fourier, Legendre, and related bases. Across extensive experiments on A100 and consumer GPUs, PolyKAN achieves 1.2–12× faster training and inference compared with baseline approaches, while preserving accuracy; end-to-end workloads show substantial epoch-time reductions and improved throughput. The solution generalizes beyond Chebyshev to other polynomial bases and demonstrates portability, offering a practical path to deploying interpretable, polynomial-based networks in AI for Science.

Abstract

Kolmogorov-Arnold Networks (KANs) promise higher expressive capability and stronger interpretability than Multi-Layer Perceptron, particularly in the domain of AI for Science. However, practical adoption has been hindered by low GPU utilization of existing parallel implementations. To address this challenge, we present a GPU-accelerated operator library, named PolyKAN which is the first general open-source implementation of KAN and its variants. PolyKAN fuses the forward and backward passes of polynomial KAN layers into a concise set of optimized CUDA kernels. Four orthogonal techniques underpin the design: (i) \emph{lookup-table} with linear interpolation that replaces runtime expensive math-library functions; (ii) \emph{2D tiling} to expose thread-level parallelism with preserving memory locality; (iii) a \emph{two-stage reduction} scheme converting scattered atomic updates into a single controllable merge step; and (iv) \emph{coefficient-layout reordering} yielding unit-stride reads under the tiled schedule. Using a KAN variant, Chebyshev KAN, as a case-study, PolyKAN delivers $1.2$--$10\times$ faster inference and $1.4$--$12\times$ faster training than a Triton + cuBLAS baseline, with identical accuracy on speech, audio-enhancement, and tabular-regression workloads on both highend GPU and consumer-grade GPU.

PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants

TL;DR

Abstract

PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (1)