Table of Contents
Fetching ...

XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

Phuong Tuan Dat, Tran Huy Dat

TL;DR

This work tackles synthetic speech detection in ASV systems by leveraging SSL-derived speech representations and introducing a Kanformer architecture that replaces the traditional MLP in XLSR-Conformer with Kolmogorov-Arnold Networks (KAN). By reconfiguring the Conformer into a Kanformer with KAN-based feed-forward and convolution modules (including ChebyKAN variants), the model achieves improved discrimination between bonafide and spoofed speech. Experiments on ASVspoof 2021 LA and DF show substantial improvements in EER and min t-DCF over strong baselines and demonstrate robustness across diverse SSL backbones. The results indicate that KAN-based architectures offer a promising path for enhancing synthetic speech detection and could generalize to other SSL-based speech tasks.

Abstract

Recent advancements in speech synthesis technologies have led to increasingly sophisticated spoofing attacks, posing significant challenges for automatic speaker verification systems. While systems based on self-supervised learning (SSL) models, particularly the XLSR-Conformer architecture, have demonstrated remarkable performance in synthetic speech detection, there remains room for architectural improvements. In this paper, we propose a novel approach that replaces the traditional Multi-Layer Perceptron (MLP) in the XLSR-Conformer model with a Kolmogorov-Arnold Network (KAN), a powerful universal approximator based on the Kolmogorov-Arnold representation theorem. Our experimental results on ASVspoof2021 demonstrate that the integration of KAN to XLSR-Conformer model can improve the performance by 60.55% relatively in Equal Error Rate (EER) LA and DF sets, further achieving 0.70% EER on the 21LA set. Besides, the proposed replacement is also robust to various SSL architectures. These findings suggest that incorporating KAN into SSL-based models is a promising direction for advances in synthetic speech detection.

XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

TL;DR

This work tackles synthetic speech detection in ASV systems by leveraging SSL-derived speech representations and introducing a Kanformer architecture that replaces the traditional MLP in XLSR-Conformer with Kolmogorov-Arnold Networks (KAN). By reconfiguring the Conformer into a Kanformer with KAN-based feed-forward and convolution modules (including ChebyKAN variants), the model achieves improved discrimination between bonafide and spoofed speech. Experiments on ASVspoof 2021 LA and DF show substantial improvements in EER and min t-DCF over strong baselines and demonstrate robustness across diverse SSL backbones. The results indicate that KAN-based architectures offer a promising path for enhancing synthetic speech detection and could generalize to other SSL-based speech tasks.

Abstract

Recent advancements in speech synthesis technologies have led to increasingly sophisticated spoofing attacks, posing significant challenges for automatic speaker verification systems. While systems based on self-supervised learning (SSL) models, particularly the XLSR-Conformer architecture, have demonstrated remarkable performance in synthetic speech detection, there remains room for architectural improvements. In this paper, we propose a novel approach that replaces the traditional Multi-Layer Perceptron (MLP) in the XLSR-Conformer model with a Kolmogorov-Arnold Network (KAN), a powerful universal approximator based on the Kolmogorov-Arnold representation theorem. Our experimental results on ASVspoof2021 demonstrate that the integration of KAN to XLSR-Conformer model can improve the performance by 60.55% relatively in Equal Error Rate (EER) LA and DF sets, further achieving 0.70% EER on the 21LA set. Besides, the proposed replacement is also robust to various SSL architectures. These findings suggest that incorporating KAN into SSL-based models is a promising direction for advances in synthetic speech detection.

Paper Structure

This paper contains 18 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall architecture of XLSR-Conformer.
  • Figure 2: The novel architecture of: a) Kanformer Block; b) Kanformer Feed Forward Module; c) Kanformer Convolution Module.