Table of Contents
Fetching ...

Hybrid Audio Detection Using Fine-Tuned Audio Spectrogram Transformers: A Dataset-Driven Evaluation of Mixed AI-Human Speech

Kunyang Huang, Bin Hu

TL;DR

This work addresses the rising threat of hybrid voice spoofing in automatic speaker verification by introducing HSAD, a dataset containing genuine, AI-generated, cloned, and hybrid speech under both clean and degraded conditions. It shows that standard AST pretraining, even when strong, struggles with complex mixtures, while fine-tuning on HSAD yields about 97% accuracy and 99% F1, with substantial reductions in false positives and negatives. The proposed approach leverages two AST variants pretrained on broad and domain-specific spoof data, adapting them via ViT-inspired transfer learning to detect hybrid audio with high reliability. Overall, HSAD provides a realistic benchmark and demonstrates the practical value of dataset-specific adaptation for robust, real-world voice authentication systems.

Abstract

The rapid advancement of artificial intelligence (AI) has enabled sophisticated audio generation and voice cloning technologies, posing significant security risks for applications reliant on voice authentication. While existing datasets and models primarily focus on distinguishing between human and fully synthetic speech, real-world attacks often involve audio that combines both genuine and cloned segments. To address this gap, we construct a novel hybrid audio dataset incorporating human, AI-generated, cloned, and mixed audio samples. We further propose fine-tuned Audio Spectrogram Transformer (AST)-based models tailored for detecting these complex acoustic patterns. Extensive experiments demonstrate that our approach significantly outperforms existing baselines in mixed-audio detection, achieving 97\% classification accuracy. Our findings highlight the importance of hybrid datasets and tailored models in advancing the robustness of speech-based authentication systems.

Hybrid Audio Detection Using Fine-Tuned Audio Spectrogram Transformers: A Dataset-Driven Evaluation of Mixed AI-Human Speech

TL;DR

This work addresses the rising threat of hybrid voice spoofing in automatic speaker verification by introducing HSAD, a dataset containing genuine, AI-generated, cloned, and hybrid speech under both clean and degraded conditions. It shows that standard AST pretraining, even when strong, struggles with complex mixtures, while fine-tuning on HSAD yields about 97% accuracy and 99% F1, with substantial reductions in false positives and negatives. The proposed approach leverages two AST variants pretrained on broad and domain-specific spoof data, adapting them via ViT-inspired transfer learning to detect hybrid audio with high reliability. Overall, HSAD provides a realistic benchmark and demonstrates the practical value of dataset-specific adaptation for robust, real-world voice authentication systems.

Abstract

The rapid advancement of artificial intelligence (AI) has enabled sophisticated audio generation and voice cloning technologies, posing significant security risks for applications reliant on voice authentication. While existing datasets and models primarily focus on distinguishing between human and fully synthetic speech, real-world attacks often involve audio that combines both genuine and cloned segments. To address this gap, we construct a novel hybrid audio dataset incorporating human, AI-generated, cloned, and mixed audio samples. We further propose fine-tuned Audio Spectrogram Transformer (AST)-based models tailored for detecting these complex acoustic patterns. Extensive experiments demonstrate that our approach significantly outperforms existing baselines in mixed-audio detection, achieving 97\% classification accuracy. Our findings highlight the importance of hybrid datasets and tailored models in advancing the robustness of speech-based authentication systems.

Paper Structure

This paper contains 34 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Our evaluation framework in a smart home audio token system. It effectively prevents unauthorized access by detecting non-authentic voices, including AI-synthesized, AI-cloned, and mixed human-AI speech, thereby ensuring secure control over smart home environments.
  • Figure 2: Construction process of the hybrid sentence dataset, encompassing recording, cloning, synthesis, and composite generation.
  • Figure 3: Architectural framework of the baseline Audio Spectrogram Transformer (AST) model.
  • Figure 4: Confusion matrix – MIT fine-tuned (Model A)
  • Figure 5: Confusion matrix – MattyB95 fine-tuned (Model B)
  • ...and 1 more figures