Table of Contents
Fetching ...

ViKANformer: Embedding Kolmogorov Arnold Networks in Vision Transformers for Pattern-Based Learning

Shreyas S, Akshath M

TL;DR

This work addresses enhancing Vision Transformers by replacing MLP blocks with Kolmogorov-Arnol d Network (KAN) expansions to improve nonlinear expressiveness. The ViKANformer framework plugs dimension-wise univariate mappings into ViT layers, leveraging the Kolmogorov-Arnold theorem to represent multivariate functions through sums of univariate components. On MNIST, SineKAN, Fast-KAN, and carefully tuned Vanilla KAN achieve about 97–98% test accuracy, with FourierKAN and Efficient-KAN showing strong performance but varying training overhead; Flash Attention offers speedups with trade-offs that require tuning. Overall, the study demonstrates the viability of KAN-based feed-forward designs within Transformer pipelines and points to scaling and efficiency improvements for broader, real-world tasks.

Abstract

Vision Transformers (ViTs) have significantly advanced image classification by applying self-attention on patch embeddings. However, the standard MLP blocks in each Transformer layer may not capture complex nonlinear dependencies optimally. In this paper, we propose ViKANformer, a Vision Transformer where we replace the MLP sub-layers with Kolmogorov-Arnold Network (KAN) expansions, including Vanilla KAN, Efficient-KAN, Fast-KAN, SineKAN, and FourierKAN, while also examining a Flash Attention variant. By leveraging the Kolmogorov-Arnold theorem, which guarantees that multivariate continuous functions can be expressed via sums of univariate continuous functions, we aim to boost representational power. Experimental results on MNIST demonstrate that SineKAN, Fast-KAN, and a well-tuned Vanilla KAN can achieve over 97% accuracy, albeit with increased training overhead. This trade-off highlights that KAN expansions may be beneficial if computational cost is acceptable. We detail the expansions, present training/test accuracy and F1/ROC metrics, and provide pseudocode and hyperparameters for reproducibility. Finally, we compare ViKANformer to a simple MLP and a small CNN baseline on MNIST, illustrating the efficiency of Transformer-based methods even on a small-scale dataset.

ViKANformer: Embedding Kolmogorov Arnold Networks in Vision Transformers for Pattern-Based Learning

TL;DR

This work addresses enhancing Vision Transformers by replacing MLP blocks with Kolmogorov-Arnol d Network (KAN) expansions to improve nonlinear expressiveness. The ViKANformer framework plugs dimension-wise univariate mappings into ViT layers, leveraging the Kolmogorov-Arnold theorem to represent multivariate functions through sums of univariate components. On MNIST, SineKAN, Fast-KAN, and carefully tuned Vanilla KAN achieve about 97–98% test accuracy, with FourierKAN and Efficient-KAN showing strong performance but varying training overhead; Flash Attention offers speedups with trade-offs that require tuning. Overall, the study demonstrates the viability of KAN-based feed-forward designs within Transformer pipelines and points to scaling and efficiency improvements for broader, real-world tasks.

Abstract

Vision Transformers (ViTs) have significantly advanced image classification by applying self-attention on patch embeddings. However, the standard MLP blocks in each Transformer layer may not capture complex nonlinear dependencies optimally. In this paper, we propose ViKANformer, a Vision Transformer where we replace the MLP sub-layers with Kolmogorov-Arnold Network (KAN) expansions, including Vanilla KAN, Efficient-KAN, Fast-KAN, SineKAN, and FourierKAN, while also examining a Flash Attention variant. By leveraging the Kolmogorov-Arnold theorem, which guarantees that multivariate continuous functions can be expressed via sums of univariate continuous functions, we aim to boost representational power. Experimental results on MNIST demonstrate that SineKAN, Fast-KAN, and a well-tuned Vanilla KAN can achieve over 97% accuracy, albeit with increased training overhead. This trade-off highlights that KAN expansions may be beneficial if computational cost is acceptable. We detail the expansions, present training/test accuracy and F1/ROC metrics, and provide pseudocode and hyperparameters for reproducibility. Finally, we compare ViKANformer to a simple MLP and a small CNN baseline on MNIST, illustrating the efficiency of Transformer-based methods even on a small-scale dataset.

Paper Structure

This paper contains 25 sections, 2 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: ViKANformer Overview. We show two Transformer blocks with their self-attention sub-layer. The feed-forward sub-layer (normally an MLP) is replaced by a dimension-wise KAN expansion. Various KAN variants (Sine, Fourier, etc.) can be plugged in.
  • Figure 2: Training Accuracy vs. Epochs on MNIST. SineKAN, Fast-KAN, and Vanilla KAN exceed 95--97% by epoch 5--6.
  • Figure 3: Test Accuracy vs. Epochs on MNIST. SineKAN and Fast-KAN reach 97--98% by epoch 10, with Vanilla KAN close behind.
  • Figure 4: All expansions eventually surpass 0.95 F1, with SineKAN and Fast-KAN frequently reaching 0.98+ and ROC AUC near 1.0.