Table of Contents
Fetching ...

Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li

TL;DR

Speech foundation models yield rich but high-dimensional features that challenge downstream anti-spoofing classifiers. Nes2Net introduces a DR-free, nested Res2Net back-end that processes high-dimensional features directly, with Nes2Net-X offering learnable feature fusion to further boost expressiveness. Across CtrSVDD, ASVspoof 2021, ASVspoof 5, In-the-Wild, and PartialSpoof, Nes2Net variants demonstrate superior accuracy, robustness, and efficiency, including significant MACs reductions and parameter savings relative to DR-based baselines. The approach reduces information loss from early projection, enhances cross-scale and cross-channel interactions, and provides a practical, publicly available solution for foundation-model–driven speech anti-spoofing. These results suggest Nes2Net as a versatile back-end option for real-time and resource-constrained deployments in security-sensitive speech applications.

Abstract

Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back-end architecture designed to directly process high-dimensional features without DR layers. The nested structure enhances multi-scale feature extraction, improves feature interaction, and preserves high-dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real-world scenarios, consistently highlights Nes2Net's superior robustness and generalization capabilities. The code package and pre-trained models are available at https://github.com/Liu-Tianchi/Nes2Net.

Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

TL;DR

Speech foundation models yield rich but high-dimensional features that challenge downstream anti-spoofing classifiers. Nes2Net introduces a DR-free, nested Res2Net back-end that processes high-dimensional features directly, with Nes2Net-X offering learnable feature fusion to further boost expressiveness. Across CtrSVDD, ASVspoof 2021, ASVspoof 5, In-the-Wild, and PartialSpoof, Nes2Net variants demonstrate superior accuracy, robustness, and efficiency, including significant MACs reductions and parameter savings relative to DR-based baselines. The approach reduces information loss from early projection, enhances cross-scale and cross-channel interactions, and provides a practical, publicly available solution for foundation-model–driven speech anti-spoofing. These results suggest Nes2Net as a versatile back-end option for real-time and resource-constrained deployments in security-sensitive speech applications.

Abstract

Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back-end architecture designed to directly process high-dimensional features without DR layers. The nested structure enhances multi-scale feature extraction, improves feature interaction, and preserves high-dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real-world scenarios, consistently highlights Nes2Net's superior robustness and generalization capabilities. The code package and pre-trained models are available at https://github.com/Liu-Tianchi/Nes2Net.

Paper Structure

This paper contains 25 sections, 4 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: The block diagram of the speech foundation model-based speech anti-spoofing system, showcasing both the traditional back-end models and the proposed Nes2Net back-end. The traditional back-end models include a DR layer and a classifier, such as ResNet resnet, Res2Net Res2Net, ECAPA-TDNN ECAPA_TDNN, and AASIST AASIST. In contrast, the proposed Nes2Net back-end model features a DR layer-free design. Additionally, an enhanced version of its nested layer, named Nes2Net-X, is introduced to further improve performance. Abbreviations used in the figure include: 'FC’ (fully connected layer), 'Conv’ (convolutional layer), 'WS’ (weighted sum), 'SE’ (squeeze-and-excitation module) SE, and 'Att. Stat. Pool.’ (attentive statistics pooling) Okabe2018.
  • Figure 2: The cyclic learning rate schedule using cosine annealing.
  • Figure 3: Visualization of Table \ref{['tab_SVDD']} and \ref{['tab_roadmap']}, highlighting our exploration of Res2Net and the roadmap of architectural changes leading to Nes2Net.
  • Figure 4: Visualization of the EER (%) across various vocoders and compression conditions on the ASVspoof 2021 DF test set. Each EER value is shown as a colored circle, where the size indicates the EER value, and the color represents the performance ranking among the five models: blue (best) to light red (worst). The five EER values for each sub-item, from left to right, correspond to the proposed Nes2Net-X, Mamba Mamba, SLS SLS, TCM TCM, and AASIST hemlata_wav2vec2.