Table of Contents
Fetching ...

From KAN to GR-KAN: Advancing Speech Enhancement with KAN-Based Methodology

Haoyang Li, Yuchen Hu, Chen Chen, Sabato Marco Siniscalchi, Songting Liu, Eng Siong Chng

TL;DR

This work investigates Kolmogorov-Arnold Networks (KAN) and their group-based rational variant (GR-KAN) for speech enhancement. It shows that standard KAN struggles to scale to complex SE tasks, while GR-KAN achieves consistent improvements when integrated into both time-frequency domain SE (MP-SENet) and time-domain SE (Demucs), with up to fourfold reductions in trainable parameters. On VoiceBank-DEMAND, GR-KAN yields PESQ enhancements up to approximately $0.1$ over baselines and outperforms KAN in the same settings. Overall, GR-KAN emerges as a potent, parameter-efficient alternative to conventional activations in SE, with potential applicability to broader speech-generation models.

Abstract

Deep neural network (DNN)-based speech enhancement (SE) usually uses conventional activation functions, which lack the expressiveness to capture complex multiscale structures needed for high-fidelity SE. Group-Rational KAN (GR-KAN), a variant of Kolmogorov-Arnold Networks (KAN), retains KAN's expressiveness while improving scalability on complex tasks. We adapt GR-KAN to existing DNN-based SE by replacing dense layers with GR-KAN layers in the time-frequency (T-F) domain MP-SENet and adapting GR-KAN's activations into the 1D CNN layers in the time-domain Demucs. Results on Voicebank-DEMAND show that GR-KAN requires up to 4x fewer parameters while improving PESQ by up to 0.1. In contrast, KAN, facing scalability issues, outperforms MLP on a small-scale signal modeling task but fails to improve MP-SENet. We demonstrate the first successful use of KAN-based methods for consistent improvement in both time- and SoTA TF-domain SE, establishing GR-KAN as a promising alternative for SE.

From KAN to GR-KAN: Advancing Speech Enhancement with KAN-Based Methodology

TL;DR

This work investigates Kolmogorov-Arnold Networks (KAN) and their group-based rational variant (GR-KAN) for speech enhancement. It shows that standard KAN struggles to scale to complex SE tasks, while GR-KAN achieves consistent improvements when integrated into both time-frequency domain SE (MP-SENet) and time-domain SE (Demucs), with up to fourfold reductions in trainable parameters. On VoiceBank-DEMAND, GR-KAN yields PESQ enhancements up to approximately over baselines and outperforms KAN in the same settings. Overall, GR-KAN emerges as a potent, parameter-efficient alternative to conventional activations in SE, with potential applicability to broader speech-generation models.

Abstract

Deep neural network (DNN)-based speech enhancement (SE) usually uses conventional activation functions, which lack the expressiveness to capture complex multiscale structures needed for high-fidelity SE. Group-Rational KAN (GR-KAN), a variant of Kolmogorov-Arnold Networks (KAN), retains KAN's expressiveness while improving scalability on complex tasks. We adapt GR-KAN to existing DNN-based SE by replacing dense layers with GR-KAN layers in the time-frequency (T-F) domain MP-SENet and adapting GR-KAN's activations into the 1D CNN layers in the time-domain Demucs. Results on Voicebank-DEMAND show that GR-KAN requires up to 4x fewer parameters while improving PESQ by up to 0.1. In contrast, KAN, facing scalability issues, outperforms MLP on a small-scale signal modeling task but fails to improve MP-SENet. We demonstrate the first successful use of KAN-based methods for consistent improvement in both time- and SoTA TF-domain SE, establishing GR-KAN as a promising alternative for SE.

Paper Structure

This paper contains 13 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Architecture of (a) the Overall MP-SENet (b) the GR-KAN adapted GRU-Transformer Block.
  • Figure 2: Architecture of the GR-KAN adapted Causal Demucs, where we replace all ReLU activations in the Encoder and Decoder blocks with GR-KAN activations. Please note that the last Decoder block does not have the GR-KAN activations.
  • Figure 3: Comparison of MLP (ReLU), MLP (GELU) and GR-KAN on fitting an artificial signal with speech dynamics