Table of Contents
Fetching ...

KKANs: Kurkova-Kolmogorov-Arnold Networks and Their Learning Dynamics

Juan Diego Toscano, Li-Lian Wang, George Em Karniadakis

TL;DR

This work introduces KKANs, a two-block architecture that closely mirrors the Kolmogorov-Arnold representation while leveraging learnable inner MLPs and flexible outer basis-function combinations. Theoretical guarantees show KKANs are universal approximators for a broad class of functions, and empirical results demonstrate superior performance to MLPs and cKANs in function approximation and neural-operator tasks, with KKANs matching state-of-the-art MLPs in PIML when enhanced with ssRBA. The authors further illuminate learning dynamics via Information Bottleneck theory, identifying fitting, transition, and diffusion stages and linking high generalization to the diffusion stage and geometric complexity. To sustain high signal-to-noise during training, they propose self-scaled residual-based attention (ssRBA) and a gradient-magnitude balancing scheme for multi-term losses, improving convergence and robustness. Overall, KKANs provide a robust, interpretable, and scalable framework for SciML across function approximation, PIML, and operator learning, with practical advantages in stability and performance.

Abstract

Inspired by the Kolmogorov-Arnold representation theorem and Kurkova's principle of using approximate representations, we propose the Kurkova-Kolmogorov-Arnold Network (KKAN), a new two-block architecture that combines robust multi-layer perceptron (MLP) based inner functions with flexible linear combinations of basis functions as outer functions. We first prove that KKAN is a universal approximator, and then we demonstrate its versatility across scientific machine-learning applications, including function regression, physics-informed machine learning (PIML), and operator-learning frameworks. The benchmark results show that KKANs outperform MLPs and the original Kolmogorov-Arnold Networks (KANs) in function approximation and operator learning tasks and achieve performance comparable to fully optimized MLPs for PIML. To better understand the behavior of the new representation models, we analyze their geometric complexity and learning dynamics using information bottleneck theory, identifying three universal learning stages, fitting, transition, and diffusion, across all types of architectures. We find a strong correlation between geometric complexity and signal-to-noise ratio (SNR), with optimal generalization achieved during the diffusion stage. Additionally, we propose self-scaled residual-based attention weights to maintain high SNR dynamically, ensuring uniform convergence and prolonged learning.

KKANs: Kurkova-Kolmogorov-Arnold Networks and Their Learning Dynamics

TL;DR

This work introduces KKANs, a two-block architecture that closely mirrors the Kolmogorov-Arnold representation while leveraging learnable inner MLPs and flexible outer basis-function combinations. Theoretical guarantees show KKANs are universal approximators for a broad class of functions, and empirical results demonstrate superior performance to MLPs and cKANs in function approximation and neural-operator tasks, with KKANs matching state-of-the-art MLPs in PIML when enhanced with ssRBA. The authors further illuminate learning dynamics via Information Bottleneck theory, identifying fitting, transition, and diffusion stages and linking high generalization to the diffusion stage and geometric complexity. To sustain high signal-to-noise during training, they propose self-scaled residual-based attention (ssRBA) and a gradient-magnitude balancing scheme for multi-term losses, improving convergence and robustness. Overall, KKANs provide a robust, interpretable, and scalable framework for SciML across function approximation, PIML, and operator learning, with practical advantages in stability and performance.

Abstract

Inspired by the Kolmogorov-Arnold representation theorem and Kurkova's principle of using approximate representations, we propose the Kurkova-Kolmogorov-Arnold Network (KKAN), a new two-block architecture that combines robust multi-layer perceptron (MLP) based inner functions with flexible linear combinations of basis functions as outer functions. We first prove that KKAN is a universal approximator, and then we demonstrate its versatility across scientific machine-learning applications, including function regression, physics-informed machine learning (PIML), and operator-learning frameworks. The benchmark results show that KKANs outperform MLPs and the original Kolmogorov-Arnold Networks (KANs) in function approximation and operator learning tasks and achieve performance comparable to fully optimized MLPs for PIML. To better understand the behavior of the new representation models, we analyze their geometric complexity and learning dynamics using information bottleneck theory, identifying three universal learning stages, fitting, transition, and diffusion, across all types of architectures. We find a strong correlation between geometric complexity and signal-to-noise ratio (SNR), with optimal generalization achieved during the diffusion stage. Additionally, we propose self-scaled residual-based attention weights to maintain high SNR dynamically, ensuring uniform convergence and prolonged learning.

Paper Structure

This paper contains 76 sections, 1 theorem, 82 equations, 21 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

Let $d\ge 2.$ Assume that ${\mathcal{A}}_{M_z}(I_z)$ are dense in $C(I_z)$ for $z=g,\psi.$ Then the subset ${\mathbb K}_M^{m,d}$ defined in KappM is dense in $C(E^d)$ with $E^d=[0,1]^d$ in the sense that for any $f\in C(E^d)$ and any $\varepsilon>0,$ there exists $F\in {\mathbb K}_M^{m,d}$ (i.e., $\ where $\boldsymbol{x}=(x_1,\ldots,x_d).$

Figures (21)

  • Figure 1: KKAN-Inspired architecture. The inner block computes the inner functions by expanding each input dimension into an $m$-dimensional space. The first combination layer sums the inner functions across the input dimensions, i.e., $\xi_q = \sum_{p=1}^d \Psi_{p, q}(x_p)$, to obtain an $m$-dimensional vector $\bm{\xi} = [\xi_0, \ldots, \xi_m]$. The outer block computes the outer functions by transforming each $\xi_q$, and the final combination layer sums all the outer functions $G_q$, enabling the approximation of the target function, closely mimicking the KART.
  • Figure 2: Enhanced-basis MLP (ebMLP). Each inner block expands its respective input dimension into an $m$-dimensional space using an enhanced Multi-Layer Perceptron (MLP). The ebMLP incorporates two trainable Chebyshev layers that perform orthogonal expansions of the inputs ($x_p$) and outputs ($\beta_i$), improving the quality of the basis functions and enhancing the network's representation capabilities.
  • Figure 3: Performance of KKANs for discontinuous function approximation. Columns show predictions, ground truth references, and absolute errors, respectively. This function is particularly challenging to learn due to two discontinuities at $x_1=0.0$ and $x_2=0.0$, along with smooth regions containing relatively high frequencies. Additionally, the function exhibits a wide range of magnitudes, with outputs spanning from $-5$ to $25$. The KKAN model achieves a relative $L^2$ error of $5.86 \times 10^{-3}$.
  • Figure 4: Results for discontinuous function approximation. (a) Relative $L^2$ error convergence on the testing dataset, evaluated on a uniform $256\times256$ mesh. KKANs converge significantly faster than MLPs, achieving a relative $L^2$ error of $5.86\times10^{-3}$ after 200,000 ADAM iterations. cKANs converge slightly faster than MLPs initially but start to overfit after several iterations, as indicated by a sudden increase in the test error. (b) Geometric complexity evolution during training. Geometric complexity, represented by the discrete Dirichlet energy, reflects the gradient of the function with respect to its inputs. For this case, the geometric complexity is significantly higher for all models due to the two discontinuities in the function, which introduce sharp changes and amplify gradient variations. Initially, cKANs exhibit lower complexity than the other methods. However, their final complexity is significantly higher, indicating overfitting. In contrast, KKANs maintain the lowest complexity throughout training, contributing to their superior generalization and performance.
  • Figure 5: Performance of KKANs+ssRBA for smooth function approximation. Columns show predictions, ground truth references, and absolute errors, respectively. This smooth function is challenging due to its rapidly varying gradients. The inclusion of ssRBA enhances convergence and accuracy, enabling the KKAN model to achieve a relative $L^2$ error of $1.75 \times 10^{-4}$.
  • ...and 16 more figures

Theorems & Definitions (2)

  • Definition 1: Set of Approximators
  • Theorem 1: Universal Approximation Theorem