Table of Contents
Fetching ...

AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

Kirill Borodin, Vasiliy Kudryavtsev, Dmitrii Korzh, Alexey Efimenko, Grach Mkrtchian, Mikhail Gorodnichev, Oleg Y. Rogov

TL;DR

This work addresses the vulnerability of automatic speaker verification to synthetic speech by introducing AASIST3, a KAN-enhanced extension of AASIST that uses Kolmogorov-Arnold networks, extra regularization, and pre-emphasis to boost spoof-detection performance. It combines SincConv and Wav2Vec2 frontends with a multi-branch, HS-GAL–based graph attention architecture to capture temporal and spatial features, achieving substantial improvements over the baseline. Key contributions include the design of KAN-GAL, KAN-GraphPool, and KAN-HS-GAL modules, the exploration of B-spline-based KAN functions, and the integration of SSL representations for open-set detection, culminating in minDCF values of 0.5357 in closed and 0.1414 in open conditions. The approach demonstrates strong practical impact for ASV security against TTS/VC-based deepfakes and provides actionable guidance for robust anti-spoofing systems in the ASVspoof 2024 landscape.

Abstract

Automatic Speaker Verification (ASV) systems, which identify speakers based on their voice characteristics, have numerous applications, such as user authentication in financial transactions, exclusive access control in smart devices, and forensic fraud detection. However, the advancement of deep learning algorithms has enabled the generation of synthetic audio through Text-to-Speech (TTS) and Voice Conversion (VC) systems, exposing ASV systems to potential vulnerabilities. To counteract this, we propose a novel architecture named AASIST3. By enhancing the existing AASIST framework with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques, AASIST3 achieves a more than twofold improvement in performance. It demonstrates minDCF results of 0.5357 in the closed condition and 0.1414 in the open condition, significantly enhancing the detection of synthetic voices and improving ASV security.

AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

TL;DR

This work addresses the vulnerability of automatic speaker verification to synthetic speech by introducing AASIST3, a KAN-enhanced extension of AASIST that uses Kolmogorov-Arnold networks, extra regularization, and pre-emphasis to boost spoof-detection performance. It combines SincConv and Wav2Vec2 frontends with a multi-branch, HS-GAL–based graph attention architecture to capture temporal and spatial features, achieving substantial improvements over the baseline. Key contributions include the design of KAN-GAL, KAN-GraphPool, and KAN-HS-GAL modules, the exploration of B-spline-based KAN functions, and the integration of SSL representations for open-set detection, culminating in minDCF values of 0.5357 in closed and 0.1414 in open conditions. The approach demonstrates strong practical impact for ASV security against TTS/VC-based deepfakes and provides actionable guidance for robust anti-spoofing systems in the ASVspoof 2024 landscape.

Abstract

Automatic Speaker Verification (ASV) systems, which identify speakers based on their voice characteristics, have numerous applications, such as user authentication in financial transactions, exclusive access control in smart devices, and forensic fraud detection. However, the advancement of deep learning algorithms has enabled the generation of synthetic audio through Text-to-Speech (TTS) and Voice Conversion (VC) systems, exposing ASV systems to potential vulnerabilities. To counteract this, we propose a novel architecture named AASIST3. By enhancing the existing AASIST framework with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques, AASIST3 achieves a more than twofold improvement in performance. It demonstrates minDCF results of 0.5357 in the closed condition and 0.1414 in the open condition, significantly enhancing the detection of synthetic voices and improving ASV security.
Paper Structure (29 sections, 40 equations, 2 figures, 2 tables)

This paper contains 29 sections, 40 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Architecture of the closed condition model.
  • Figure 2: The KAN-HS-GAL Operation.