Table of Contents
Fetching ...

"KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding

Alkis Koudounas, Moreno La Quatra, Eliana Pastor, Sabato Marco Siniscalchi, Elena Baralis

TL;DR

This work investigates Kolmogorov-Arnold Networks (KANs) for Spoken Language Understanding by integrating KAN layers into CNN-based feature extractors on spectrogram inputs and then transferring the best configuration to transformer-based SLU models. Five architectural configurations are evaluated to determine optimal placement of KAN layers relative to linear layers, revealing that FKF (a KAN layer between two FF layers) often yields the strongest performance without extra cost. Across five SLU datasets (including English, Italian, German, and French), KAN-enhanced transformers generalize well and produce more human-aligned explanations on input regions. The results support using learnable activation-function layers as a viable alternative to standard linear layers in SLU, with robust cross-architecture and multilingual applicability.

Abstract

Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional neural architectures, yet their application to speech processing remains under explored. This work presents the first investigation of KANs for Spoken Language Understanding (SLU) tasks. We experiment with 2D-CNN models on two datasets, integrating KAN layers in five different configurations within the dense block. The best-performing setup, which places a KAN layer between two linear layers, is directly applied to transformer-based models and evaluated on five SLU datasets with increasing complexity. Our results show that KAN layers can effectively replace the linear layers, achieving comparable or superior performance in most cases. Finally, we provide insights into how KAN and linear layers on top of transformers differently attend to input regions of the raw waveforms.

"KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding

TL;DR

This work investigates Kolmogorov-Arnold Networks (KANs) for Spoken Language Understanding by integrating KAN layers into CNN-based feature extractors on spectrogram inputs and then transferring the best configuration to transformer-based SLU models. Five architectural configurations are evaluated to determine optimal placement of KAN layers relative to linear layers, revealing that FKF (a KAN layer between two FF layers) often yields the strongest performance without extra cost. Across five SLU datasets (including English, Italian, German, and French), KAN-enhanced transformers generalize well and produce more human-aligned explanations on input regions. The results support using learnable activation-function layers as a viable alternative to standard linear layers in SLU, with robust cross-architecture and multilingual applicability.

Abstract

Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional neural architectures, yet their application to speech processing remains under explored. This work presents the first investigation of KANs for Spoken Language Understanding (SLU) tasks. We experiment with 2D-CNN models on two datasets, integrating KAN layers in five different configurations within the dense block. The best-performing setup, which places a KAN layer between two linear layers, is directly applied to transformer-based models and evaluated on five SLU datasets with increasing complexity. Our results show that KAN layers can effectively replace the linear layers, achieving comparable or superior performance in most cases. Finally, we provide insights into how KAN and linear layers on top of transformers differently attend to input regions of the raw waveforms.

Paper Structure

This paper contains 10 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Proposed configurations' overview.FFF is the baseline with a fully-connected MLP. The other configurations show five alternatives differently integrating linear and KAN layers.
  • Figure 2: Ablation on hidden size. Fixed (a) and variable (b) hidden size study on FKF configuration.
  • Figure 3: Example of word-level explanations for FFF and FKF predictions; Timers and Such, SimpleMath intent.