Table of Contents
Fetching ...

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

TL;DR

This work identifies a gap in MHSA-based synthetic speech detectors: temporal-channel dependencies are underutilized. It introduces Temporal-Channel Modeling (TCM), which replaces MHSA in the XLSR-Conformer with a three-part module that generates head tokens, applies MHSA to temporal-channel tokens, and enriches the classification token with temporal and head-token information. With only ~0.03M extra parameters, TCM yields competitive or superior results on ASVspoof 2021 LA/DF, including a notable EER reduction and state-of-the-art performance on the DF track, while maintaining robustness across architectures. The findings demonstrate the practical value of explicitly modeling temporal-channel interactions for detecting synthetic speech artifacts in real-world conditions.

Abstract

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

TL;DR

This work identifies a gap in MHSA-based synthetic speech detectors: temporal-channel dependencies are underutilized. It introduces Temporal-Channel Modeling (TCM), which replaces MHSA in the XLSR-Conformer with a three-part module that generates head tokens, applies MHSA to temporal-channel tokens, and enriches the classification token with temporal and head-token information. With only ~0.03M extra parameters, TCM yields competitive or superior results on ASVspoof 2021 LA/DF, including a notable EER reduction and state-of-the-art performance on the DF track, while maintaining robustness across architectures. The findings demonstrate the practical value of explicitly modeling temporal-channel interactions for detecting synthetic speech artifacts in real-world conditions.

Abstract

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.
Paper Structure (17 sections, 1 equation, 1 figure, 4 tables)

This paper contains 17 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The overall architecture of the baseline XLSR-Conformer and our proposed temporal-channel modeling (TCM) module. The TCM module is used to replace the multi-head self-attention (MHSA) of each Conformer block in the baseline XLSR-Conformer. The TCM module architecture includes three main parts: Head Token Generation, Multi-Head Self-Attention, and Classification Token Enrichment. The objective of TCM is to generate the head token for channel information and then integrate the temporal-channel dependency into the original temporal tokens for better synthetic speech detection.