Table of Contents
Fetching ...

Initialization Schemes for Kolmogorov-Arnold Networks: An Empirical Study

Spyros Rigas, Dhruv Verma, Georgios Alexandridis, Yixuan Wang

TL;DR

This study systematically analyzes initialization schemes for spline-based Kolmogorov–Arnold Networks (KANs), proposing LeCun-inspired variants, Glorot-inspired initialization, and an empirical power-law family. Through large-scale grid searches on function fitting and forward PDE benchmarks, complemented by Neural Tangent Kernel (NTK) analyses and Feynman dataset experiments, the authors show that power-law initialization yields the strongest and most robust gains across tasks and model sizes, while Glorot initialization offers reliable improvements for parameter-rich architectures. LeCun-inspired schemes provide limited benefits, particularly in smaller models. The results establish practical initialization guidelines for KANs and highlight the importance of initialization in enabling fast convergence and accurate function representation in spline-based architectures.

Abstract

Kolmogorov-Arnold Networks (KANs) are a recently introduced neural architecture that replace fixed nonlinearities with trainable activation functions, offering enhanced flexibility and interpretability. While KANs have been applied successfully across scientific and machine learning tasks, their initialization strategies remain largely unexplored. In this work, we study initialization schemes for spline-based KANs, proposing two theory-driven approaches inspired by LeCun and Glorot, as well as an empirical power-law family with tunable exponents. Our evaluation combines large-scale grid searches on function fitting and forward PDE benchmarks, an analysis of training dynamics through the lens of the Neural Tangent Kernel, and evaluations on a subset of the Feynman dataset. Our findings indicate that the Glorot-inspired initialization significantly outperforms the baseline in parameter-rich models, while power-law initialization achieves the strongest performance overall, both across tasks and for architectures of varying size. All code and data accompanying this manuscript are publicly available at https://github.com/srigas/KAN_Initialization_Schemes.

Initialization Schemes for Kolmogorov-Arnold Networks: An Empirical Study

TL;DR

This study systematically analyzes initialization schemes for spline-based Kolmogorov–Arnold Networks (KANs), proposing LeCun-inspired variants, Glorot-inspired initialization, and an empirical power-law family. Through large-scale grid searches on function fitting and forward PDE benchmarks, complemented by Neural Tangent Kernel (NTK) analyses and Feynman dataset experiments, the authors show that power-law initialization yields the strongest and most robust gains across tasks and model sizes, while Glorot initialization offers reliable improvements for parameter-rich architectures. LeCun-inspired schemes provide limited benefits, particularly in smaller models. The results establish practical initialization guidelines for KANs and highlight the importance of initialization in enabling fast convergence and accurate function representation in spline-based architectures.

Abstract

Kolmogorov-Arnold Networks (KANs) are a recently introduced neural architecture that replace fixed nonlinearities with trainable activation functions, offering enhanced flexibility and interpretability. While KANs have been applied successfully across scientific and machine learning tasks, their initialization strategies remain largely unexplored. In this work, we study initialization schemes for spline-based KANs, proposing two theory-driven approaches inspired by LeCun and Glorot, as well as an empirical power-law family with tunable exponents. Our evaluation combines large-scale grid searches on function fitting and forward PDE benchmarks, an analysis of training dynamics through the lens of the Neural Tangent Kernel, and evaluations on a subset of the Feynman dataset. Our findings indicate that the Glorot-inspired initialization significantly outperforms the baseline in parameter-rich models, while power-law initialization achieves the strongest performance overall, both across tasks and for architectures of varying size. All code and data accompanying this manuscript are publicly available at https://github.com/srigas/KAN_Initialization_Schemes.

Paper Structure

This paper contains 28 sections, 38 equations, 23 figures, 5 tables.

Figures (23)

  • Figure 1: Training loss curves for function fitting benchmarks under baseline, LeCun-numerical, Glorot and power-law ($\alpha = 0.25, \beta = 1.75$) initializations. Results are averaged over five seeds, with shaded regions indicating the standard error. Top row: "small" architecture ($G=5$, two hidden layers with 8 neurons each). Bottom row: "large" architecture ($G=20$, three hidden layers with 32 neurons each).
  • Figure 2: Training loss curves for forward PDE benchmarks under baseline, LeCun-numerical, Glorot, and power-law ($\alpha = 0.25, \beta = 1.75$) initializations. Results are averaged over five seeds, with shaded regions indicating the standard error. Top row: "small" architecture ($G=5$, two hidden layers with 8 neurons each). Bottom row: "large" architecture ($G=20$, three hidden layers with 32 neurons each).
  • Figure 3: Eigenvalue spectra of the NTK matrix at initialization (solid blue), intermediate iterations (dashed teal), and final iteration (dashed green) for function fitting benchmark $f_3(x,y)$ under different initialization strategies. Results correspond to the "large" architecture ($G=20$, three hidden layers with 32 neurons each). The power-law initialization uses $\alpha = 0.25, \beta = 1.75$.
  • Figure 4: NTK eigenvalue spectra for the Allen--Cahn PDE benchmark under baseline, LeCun-numerical, Glorot, and power-law ($\alpha = 0.25, \beta = 1.75$) initializations. Top row: spectra corresponding to the PDE residual term. Bottom row: spectra for the boundary/initial condition terms. Solid blue lines show the initialization, dashed teal lines show intermediate iterations, and dashed green lines show the final iteration. Results correspond to the "large" architecture ($G=20$, three hidden layers with 32 neurons each).
  • Figure 5: Reference surfaces for the five two-dimensional target functions $f_1$ through $f_5$ used in the function fitting experiments.
  • ...and 18 more figures