KKANs: Kurkova-Kolmogorov-Arnold Networks and Their Learning Dynamics
Juan Diego Toscano, Li-Lian Wang, George Em Karniadakis
TL;DR
This work introduces KKANs, a two-block architecture that closely mirrors the Kolmogorov-Arnold representation while leveraging learnable inner MLPs and flexible outer basis-function combinations. Theoretical guarantees show KKANs are universal approximators for a broad class of functions, and empirical results demonstrate superior performance to MLPs and cKANs in function approximation and neural-operator tasks, with KKANs matching state-of-the-art MLPs in PIML when enhanced with ssRBA. The authors further illuminate learning dynamics via Information Bottleneck theory, identifying fitting, transition, and diffusion stages and linking high generalization to the diffusion stage and geometric complexity. To sustain high signal-to-noise during training, they propose self-scaled residual-based attention (ssRBA) and a gradient-magnitude balancing scheme for multi-term losses, improving convergence and robustness. Overall, KKANs provide a robust, interpretable, and scalable framework for SciML across function approximation, PIML, and operator learning, with practical advantages in stability and performance.
Abstract
Inspired by the Kolmogorov-Arnold representation theorem and Kurkova's principle of using approximate representations, we propose the Kurkova-Kolmogorov-Arnold Network (KKAN), a new two-block architecture that combines robust multi-layer perceptron (MLP) based inner functions with flexible linear combinations of basis functions as outer functions. We first prove that KKAN is a universal approximator, and then we demonstrate its versatility across scientific machine-learning applications, including function regression, physics-informed machine learning (PIML), and operator-learning frameworks. The benchmark results show that KKANs outperform MLPs and the original Kolmogorov-Arnold Networks (KANs) in function approximation and operator learning tasks and achieve performance comparable to fully optimized MLPs for PIML. To better understand the behavior of the new representation models, we analyze their geometric complexity and learning dynamics using information bottleneck theory, identifying three universal learning stages, fitting, transition, and diffusion, across all types of architectures. We find a strong correlation between geometric complexity and signal-to-noise ratio (SNR), with optimal generalization achieved during the diffusion stage. Additionally, we propose self-scaled residual-based attention weights to maintain high SNR dynamically, ensuring uniform convergence and prolonged learning.
