Table of Contents
Fetching ...

MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection

Xueping Zhang, Zhenshan Zhang, Yechen Wang, Linxi Li, Liwei Jin, Ming Li

TL;DR

To address the mismatch between benchmarks and real-world API-based spoofing, the authors release MultiAPI Spoof, a dataset of ~230 hours of spoofed speech from 30 APIs, plus an API tracing task.They propose Nes2Net-LA, a local-attention enhanced variant of Nes2Net, improving local context modeling and discriminative spoofing features, achieving state-of-the-art results.Experimental results show that training with MultiAPI Spoof improves cross-domain robustness even on unseen APIs, and Nes2Net-LA consistently outperforms Nes2Net-X.The API tracing setup reveals strong performance on seen APIs but remaining challenges for zero-shot attribution, guiding future invariant representations.

Abstract

Existing speech anti-spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real-world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi-API audio anti-spoofing dataset comprising about 230 hours of synthetic speech generated by 30 distinct APIs, including commercial services, open-source models, and online platforms. Based on this dataset, we define the API tracing task, enabling fine-grained attribution of spoofed audio to its generation source. We further propose Nes2Net-LA, a local-attention enhanced variant of Nes2Net that improves local context modeling and fine-grained spoofing feature extraction. Experiments show that Nes2Net-LA achieves state-of-the-art performance and offers superior robustness, particularly under diverse and unseen spoofing conditions. Code \footnote{https://github.com/XuepingZhang/MultiAPI-Spoof} and dataset \footnote{https://xuepingzhang.github.io/MultiAPI-Spoof-Dataset/} have released.

MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection

TL;DR

To address the mismatch between benchmarks and real-world API-based spoofing, the authors release MultiAPI Spoof, a dataset of ~230 hours of spoofed speech from 30 APIs, plus an API tracing task.They propose Nes2Net-LA, a local-attention enhanced variant of Nes2Net, improving local context modeling and discriminative spoofing features, achieving state-of-the-art results.Experimental results show that training with MultiAPI Spoof improves cross-domain robustness even on unseen APIs, and Nes2Net-LA consistently outperforms Nes2Net-X.The API tracing setup reveals strong performance on seen APIs but remaining challenges for zero-shot attribution, guiding future invariant representations.

Abstract

Existing speech anti-spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real-world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi-API audio anti-spoofing dataset comprising about 230 hours of synthetic speech generated by 30 distinct APIs, including commercial services, open-source models, and online platforms. Based on this dataset, we define the API tracing task, enabling fine-grained attribution of spoofed audio to its generation source. We further propose Nes2Net-LA, a local-attention enhanced variant of Nes2Net that improves local context modeling and fine-grained spoofing feature extraction. Experiments show that Nes2Net-LA achieves state-of-the-art performance and offers superior robustness, particularly under diverse and unseen spoofing conditions. Code \footnote{https://github.com/XuepingZhang/MultiAPI-Spoof} and dataset \footnote{https://xuepingzhang.github.io/MultiAPI-Spoof-Dataset/} have released.

Paper Structure

This paper contains 15 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overall architecture of the Nes2Net-X and proposed Nes2Net-LA frameworks. The model first extracts high-dimensional representations from the input audio and then processes them using nested multi-scale feature fusion. Nes2Net-LA further enhances cross-block interactions through a sliding-window local attention mechanism.
  • Figure 2: Scoreq Scoreq distribution comparison across datasets. The dashed vertical line in each curve marks the peak density value.
  • Figure 3: t-SNE tsne visualization of XLSR-extracted embeddings for the MultiAPI Spoof eval set. Unseen APIs are A24-A29