Table of Contents
Fetching ...

Function Fitting Based on Kolmogorov-Arnold Theorem and Kernel Functions

Jianpeng Liu, Qizhi Pan

TL;DR

This work establishes a unified kernel-based framework that reframes Kolmogorov-Arnold Networks and self-attention as kernel expansions, enabling a common function-fitting perspective anchored by $f(\mathbf{x})=\sum_{h=1}^{H}\phi_h(\sum_{d=1}^D\psi_{h,d}(x_d))$ with $H=2D+1$. It then develops kernel-based MHSA variants, notably a low-rank Pseudo-MHSA that reduces parameters by about 23% and a Gaussian-MHSA that validates nonlinear kernel benefits, both integrated into a ViT/MAE-style encoder. Empirical results on CIFAR-10 under MAE show Semi-Fusion achieving the highest accuracy among the variants (0.8243 versus a 0.8162 baseline), while Gaussian-MHSA offers a lightweight option with competitive performance. The work further demonstrates a convolutional interpretation of attention within the kernel framework and discusses implications for efficient Transformers and future extensions to larger multimodal tasks.

Abstract

This paper proposes a unified theoretical framework based on the Kolmogorov-Arnold representation theorem and kernel methods. By analyzing the mathematical relationship among kernels, B-spline basis functions in Kolmogorov-Arnold Networks (KANs) and the inner product operation in self-attention mechanisms, we establish a kernel-based feature fitting framework that unifies the two models as linear combinations of kernel functions. Under this framework, we propose a low-rank Pseudo-Multi-Head Self-Attention module (Pseudo-MHSA), which reduces the parameter count of traditional MHSA by nearly 50\%. Furthermore, we design a Gaussian kernel multi-head self-attention variant (Gaussian-MHSA) to validate the effectiveness of nonlinear kernel functions in feature extraction. Experiments on the CIFAR-10 dataset demonstrate that Pseudo-MHSA model achieves performance comparable to the ViT model of the same dimensionality under the MAE framework and visualization analysis reveals their similarity of multi-head distribution patterns. Our code is publicly available.

Function Fitting Based on Kolmogorov-Arnold Theorem and Kernel Functions

TL;DR

This work establishes a unified kernel-based framework that reframes Kolmogorov-Arnold Networks and self-attention as kernel expansions, enabling a common function-fitting perspective anchored by with . It then develops kernel-based MHSA variants, notably a low-rank Pseudo-MHSA that reduces parameters by about 23% and a Gaussian-MHSA that validates nonlinear kernel benefits, both integrated into a ViT/MAE-style encoder. Empirical results on CIFAR-10 under MAE show Semi-Fusion achieving the highest accuracy among the variants (0.8243 versus a 0.8162 baseline), while Gaussian-MHSA offers a lightweight option with competitive performance. The work further demonstrates a convolutional interpretation of attention within the kernel framework and discusses implications for efficient Transformers and future extensions to larger multimodal tasks.

Abstract

This paper proposes a unified theoretical framework based on the Kolmogorov-Arnold representation theorem and kernel methods. By analyzing the mathematical relationship among kernels, B-spline basis functions in Kolmogorov-Arnold Networks (KANs) and the inner product operation in self-attention mechanisms, we establish a kernel-based feature fitting framework that unifies the two models as linear combinations of kernel functions. Under this framework, we propose a low-rank Pseudo-Multi-Head Self-Attention module (Pseudo-MHSA), which reduces the parameter count of traditional MHSA by nearly 50\%. Furthermore, we design a Gaussian kernel multi-head self-attention variant (Gaussian-MHSA) to validate the effectiveness of nonlinear kernel functions in feature extraction. Experiments on the CIFAR-10 dataset demonstrate that Pseudo-MHSA model achieves performance comparable to the ViT model of the same dimensionality under the MAE framework and visualization analysis reveals their similarity of multi-head distribution patterns. Our code is publicly available.

Paper Structure

This paper contains 22 sections, 2 theorems, 25 equations, 12 figures, 2 tables, 2 algorithms.

Key Result

Theorem 2.1

For any continuous multivariate function $f(\mathbf{x}): \mathbb{R}^D \to \mathbb{R}$ defined on a bounded domain, there exist: such that: where all $\phi_h$ and $\psi_{h,d}$ are continuous univariate functions.

Figures (12)

  • Figure 1: inner/outer function progress. In the left block, each cell is a vector of dimension D, and ref matrix can be trainable parameters or fixed matrix like inputs itself. In the middle block, each cell is a D$\times$D matrix. For kernel tensor, the cell at s-th row and h-th column, noted as $\textbf{K}_{sr}$, is computed by $x_{s}$ and $ref_{r}$. Using $\textbf{W}_{hr}$ to indicate the cell at h-th row and r-th column, the output $y_{sh}$ is calculated as lemma (\ref{['lemma:KA-Kernel']})
  • Figure 2: Architectural equivalence between self-attention and convolutional operations. (a) Inner Convolution: Each $D \times D$ block in $\mathbf{K}(\mathbf{X}, \mathbf{X})$ (reshaped accordingly) undergoes depthwise convolution with the kernel $\mathbf{W}_{\mathrm{attn}}$. (b) Outer Convolution: For output channel $e$, the outer kernel tensor $\mathbf{K}(\textbf{Attention Map}, \mathbf{X}^\top)$ (reshaped) is processed through strided convolutions with a set of concatenated identity-mapped kernels $\bigl[w_{e,1} I_S, \dots, w_{e,d} I_S\bigr]$ to generate the final output.
  • Figure 3: Attention mechanism and model encoder. The multi-head kernel self-attention mechanism comprises the In-projection step, the multi-head kernel self-attention as the inner function, softmax as normalization, and the outer kernel function combined with Out-projection as the outer function.
  • Figure 4: MAE Autoencoder. The basic framework of our model follows the original MAE design as closely as possible, but we replace the original ViT encoder and decoder with our model encoder and decoder. In the linear projection layer, a convolutional layer is employed to map image patches into sequence embeddings. The last layer of the decoder utilizes a transpose convolutional layer to directly reconstruct image patches from the decoder's output sequence embeddings.
  • Figure 5: Test accuracy on CIFAR-10.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Theorem 2.1: Kolmogorov-Arnold
  • Lemma 2.2: Superposition-Kernel Formulation