Table of Contents
Fetching ...

Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

Subhajit Maity, Killian Hitsman, Xin Li, Aritra Dutta

TL;DR

The paper investigates whether learnable attention built from Kolmogorov-Arnold Networks (KANs) can improve Vision Transformers (ViTs). It introduces Fourier-based, low-rank Kolmogorov-Arnold Attention (KArAt) to replace the softmax in multi-head self-attention, enabling basis-flexible token interactions with a compact representation. Empirical results show notable gains for small ViTs on CIFAR-10/100 and competitive performance on ImageNet-1K, but larger ViTs exhibit diminishing improvements and higher memory/compute costs, highlighting scalability challenges. Across analyses of loss landscapes, spectral properties, and attention visualizations, KArAt demonstrates enhanced interpretability and richer token interactions for smaller models, while underscoring the need for efficiency-focused research to unlock its potential at scale.

Abstract

Kolmogorov-Arnold networks (KANs) are a remarkable innovation that consists of learnable activation functions, with the potential to capture more complex relationships from data. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep networks, including advanced architectures such as vision Transformers (ViTs). This work asks whether KAN could learn token interactions. In this paper, we design the first learnable attention called Kolmogorov-Arnold Attention (KArAt) for ViTs that can operate on any basis, ranging from Fourier, Wavelets, Splines, to Rational Functions. However, learnable activations in the attention cause a memory explosion. To remedy this, we propose a modular version of KArAt that uses a low-rank approximation. By adopting the Fourier basis, Fourier-KArAt and its variants, in some cases, outperform their traditional softmax counterparts, or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K. We also deploy Fourier KArAt to ConViT and Swin-Transformer, and use it in detection and segmentation with ViT-Det. We dissect the performance of these architectures by analyzing their loss landscapes, weight distributions, optimizer paths, attention visualizations, and transferability to other datasets. KArAt's learnable activation yields a better attention score across all ViTs, indicating improved token-to-token interactions and contributing to enhanced inference. Still, its generalizability does not scale with larger ViTs. However, many factors, including the present computing interface, affect the relative performance of parameter- and memory-heavy KArAts. We note that the goal of this paper is not to produce efficient attention or challenge the traditional activations; by designing KArAt, we are the first to show that attention can be learned and encourage researchers to explore KArAt in conjunction with more advanced architectures.

Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

TL;DR

The paper investigates whether learnable attention built from Kolmogorov-Arnold Networks (KANs) can improve Vision Transformers (ViTs). It introduces Fourier-based, low-rank Kolmogorov-Arnold Attention (KArAt) to replace the softmax in multi-head self-attention, enabling basis-flexible token interactions with a compact representation. Empirical results show notable gains for small ViTs on CIFAR-10/100 and competitive performance on ImageNet-1K, but larger ViTs exhibit diminishing improvements and higher memory/compute costs, highlighting scalability challenges. Across analyses of loss landscapes, spectral properties, and attention visualizations, KArAt demonstrates enhanced interpretability and richer token interactions for smaller models, while underscoring the need for efficiency-focused research to unlock its potential at scale.

Abstract

Kolmogorov-Arnold networks (KANs) are a remarkable innovation that consists of learnable activation functions, with the potential to capture more complex relationships from data. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep networks, including advanced architectures such as vision Transformers (ViTs). This work asks whether KAN could learn token interactions. In this paper, we design the first learnable attention called Kolmogorov-Arnold Attention (KArAt) for ViTs that can operate on any basis, ranging from Fourier, Wavelets, Splines, to Rational Functions. However, learnable activations in the attention cause a memory explosion. To remedy this, we propose a modular version of KArAt that uses a low-rank approximation. By adopting the Fourier basis, Fourier-KArAt and its variants, in some cases, outperform their traditional softmax counterparts, or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K. We also deploy Fourier KArAt to ConViT and Swin-Transformer, and use it in detection and segmentation with ViT-Det. We dissect the performance of these architectures by analyzing their loss landscapes, weight distributions, optimizer paths, attention visualizations, and transferability to other datasets. KArAt's learnable activation yields a better attention score across all ViTs, indicating improved token-to-token interactions and contributing to enhanced inference. Still, its generalizability does not scale with larger ViTs. However, many factors, including the present computing interface, affect the relative performance of parameter- and memory-heavy KArAts. We note that the goal of this paper is not to produce efficient attention or challenge the traditional activations; by designing KArAt, we are the first to show that attention can be learned and encourage researchers to explore KArAt in conjunction with more advanced architectures.

Paper Structure

This paper contains 34 sections, 2 theorems, 8 equations, 26 figures, 22 tables, 2 algorithms.

Key Result

Theorem 1

kolmogorov1956 For any multivariate continuous function, $f:[0,1]^{n}\to\mathbb{R}$, there exists a finite composition of continuous single-variable functions, $\phi_{q,p}:[0,1]\to\mathbb{R}, \Phi_{q}:\mathbb{R}\to\mathbb{R}$ such that $f(x) = f(x_{1},x_{2},\cdots,x_{n}) = \sum_{q=1}^{2n+1}\Phi_{q}\

Figures (26)

  • Figure 1: (a) Model parameters vs. Top-1 accuracy in ImegeNet-1K training of vanilla ViTs dosovitskiy2020vit, Vision KAN (DeiT+KAN) by VisionKAN2024, ViT+KAN and Kolmogorov-Arnold Transformer (KAT) by yang2024kolmogorov.(b-i) The traditional softmax attention. (b-ii) The Kolmogorov-Arnold Attention (KArAt) replaces the softmax with a learnable operator, $\Phi^{i,j}$. (b-iii) Regular KArAt uses an operator matrix, $\Phi^{i,j}$ with $N^2$ learnable units acting on each row of ${\cal A}^{i,j}$, and is prohibitively expensive. (b-iv) Modular KArAt uses an operator $\widehat{\Phi}^{i,j}\in\mathbb{R}^{N \times r}$ with $r\ll N$, followed by a learnable linear projector $W\in \mathbb{R}^{r \times N}$.
  • Figure 2: Different configurations to update $\widehat{\Phi}$:(a) Blockwise configuration, where $\Phi^{i,1}\neq\Phi^{i,2}\neq\cdots\neq\Phi^{i,L}$ for all $i=1,2,...,h$; (b) universal configuration, where $\Phi^{i,1}=\Phi^{i,2}=\cdots=\Phi^{i,L}=\Phi^{i}$ for all $i=1,2,...,h.$
  • Figure 3: 3D-visualization of Loss landscape for ViT-Tiny and ViT-Base along the two largest principal component directions of the successive change of model parameters. KArAt's loss landscapes are significantly less smooth than those of traditional attention; spiky loss landscapes are undesirable for optimization stability and the generalizability of the resulting model. See Figure \ref{['fig:landscape_2D']} for the loss contours and the optimizer trajectory.
  • Figure 4: Vit-Tiny Attention map visualization. Original images for inference (the left), the attention score (middle), and image regions of the dominant head (Top row: Fourier KArAt, bottom row: traditional MHSA).
  • Figure 5: Attention matrix ${\mathcal{A}^{i,j}}$ before softmax activation.
  • ...and 21 more figures

Theorems & Definitions (2)

  • Theorem 1: Kolmogorov-Arnold Representation Theorem
  • Theorem 2