Table of Contents
Fetching ...

Gradient weighting for speaker verification in extremely low Signal-to-Noise Ratio

Yi Ma, Kong Aik Lee, Ville Hautamäki, Meng Ge, Haizhou Li

TL;DR

This work tackles speaker verification under extremely low SNR by introducing Gradient Weighting (Grad-W), a framework that uses gradient-based artifact-noise detection to guide a denoising enhancement. A fixed pre-trained speaker model ${\mathcal E}$ guides a U-Net enhancement network ${\mathcal G}$ that produces a mask $M(t,f)$ to form the enhanced spectrogram $E(t,f)=M(t,f)X(t,f)$, while a gradient-based measure identifies artifact-prone time-frequency bins via per-bin weights ${P_{t,f}}$ applied to a weighted $L_1$ loss between clean and enhanced activation maps. Empirical results on VoxCeleb2/Vox1-O with MUSAN/RIR augmentation show that Grad-W consistently improves EER and minDCF across a wide range of SNRs, especially below 0 dB, outperforming baselines and ablations. This approach offers a practical path to robust speaker verification in noise by explicitly targeting and de-emphasizing artifact noise during enhancement. The work also suggests future directions for leveraging gradients to mitigate multi-loss conflicts and for more effective downstream signal extraction.

Abstract

Speaker verification is hampered by background noise, particularly at extremely low Signal-to-Noise Ratio (SNR) under 0 dB. It is difficult to suppress noise without introducing unwanted artifacts, which adversely affects speaker verification. We proposed the mechanism called Gradient Weighting (Grad-W), which dynamically identifies and reduces artifact noise during prediction. The mechanism is based on the property that the gradient indicates which parts of the input the model is paying attention to. Specifically, when the speaker network focuses on a region in the denoised utterance but not on the clean counterpart, we consider it artifact noise and assign higher weights for this region during optimization of enhancement. We validate it by training an enhancement model and testing the enhanced utterance on speaker verification. The experimental results show that our approach effectively reduces artifact noise, improving speaker verification across various SNR levels.

Gradient weighting for speaker verification in extremely low Signal-to-Noise Ratio

TL;DR

This work tackles speaker verification under extremely low SNR by introducing Gradient Weighting (Grad-W), a framework that uses gradient-based artifact-noise detection to guide a denoising enhancement. A fixed pre-trained speaker model guides a U-Net enhancement network that produces a mask to form the enhanced spectrogram , while a gradient-based measure identifies artifact-prone time-frequency bins via per-bin weights applied to a weighted loss between clean and enhanced activation maps. Empirical results on VoxCeleb2/Vox1-O with MUSAN/RIR augmentation show that Grad-W consistently improves EER and minDCF across a wide range of SNRs, especially below 0 dB, outperforming baselines and ablations. This approach offers a practical path to robust speaker verification in noise by explicitly targeting and de-emphasizing artifact noise during enhancement. The work also suggests future directions for leveraging gradients to mitigate multi-loss conflicts and for more effective downstream signal extraction.

Abstract

Speaker verification is hampered by background noise, particularly at extremely low Signal-to-Noise Ratio (SNR) under 0 dB. It is difficult to suppress noise without introducing unwanted artifacts, which adversely affects speaker verification. We proposed the mechanism called Gradient Weighting (Grad-W), which dynamically identifies and reduces artifact noise during prediction. The mechanism is based on the property that the gradient indicates which parts of the input the model is paying attention to. Specifically, when the speaker network focuses on a region in the denoised utterance but not on the clean counterpart, we consider it artifact noise and assign higher weights for this region during optimization of enhancement. We validate it by training an enhancement model and testing the enhanced utterance on speaker verification. The experimental results show that our approach effectively reduces artifact noise, improving speaker verification across various SNR levels.
Paper Structure (14 sections, 6 equations, 2 figures, 2 tables)

This paper contains 14 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: To optimize the enhancement model (a), we generate activation maps and gradients from a pre-trained and fixed speaker network in (c). The activation map and gradients are used in computing the loss in (b).
  • Figure 2: Comparison of enhanced utterance generated from Equal-W and Grad-W. Figure best viewed in color.