Table of Contents
Fetching ...

Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features

Daniel Neururer, Volker Dellwo, Thilo Stadelmann

TL;DR

This paper tackles the question of whether deep neural networks for speaker recognition leverage supra-segmental temporal features (SST) or rely mainly on frame-based acoustic information (FBA). It introduces a time-scrambling test to quantify SST exploitation and evaluates multiple CNN/RNN/ResNet models on TIMIT, revealing that SST contributions are minimal—the networks can achieve competitive performance without SST, a phenomenon dubbed deep cheating. To push networks toward SST usage, the authors explore regularization strategies by increasing task difficulty with VoxCeleb and reducing FBA discriminability via spectral equalization; these approaches yield limited or inconsistent SST uptake. The findings highlight a path toward improved explainability and SST exploitation, suggesting future work in inductive biases and pre-training (e.g., transformers) to integrate dynamic features more effectively.

Abstract

While deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.

Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features

TL;DR

This paper tackles the question of whether deep neural networks for speaker recognition leverage supra-segmental temporal features (SST) or rely mainly on frame-based acoustic information (FBA). It introduces a time-scrambling test to quantify SST exploitation and evaluates multiple CNN/RNN/ResNet models on TIMIT, revealing that SST contributions are minimal—the networks can achieve competitive performance without SST, a phenomenon dubbed deep cheating. To push networks toward SST usage, the authors explore regularization strategies by increasing task difficulty with VoxCeleb and reducing FBA discriminability via spectral equalization; these approaches yield limited or inconsistent SST uptake. The findings highlight a path toward improved explainability and SST exploitation, suggesting future work in inductive biases and pre-training (e.g., transformers) to integrate dynamic features more effectively.

Abstract

While deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.
Paper Structure (10 sections, 2 figures, 3 tables)

This paper contains 10 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Segment creation with and without SST: Starting (a) from a spectrogram per varying-length utterance, we extract segments of fixed length $t$ from a starting point according to $3$ segment-drawing strategies as follows. Original Segment (OS): just cut out (b) the respective part; Shuffled within Segment (SS): additionally, shuffle (c) the columns of the previous output; Shuffled within Utterance (SU): globally shuffle all columns (d) prior to cutting (e).
  • Figure 2: Visualization of FBA equalization: Compressed spectrograms of the same sentence (SA1) of a male (a: MDAB0) and female (b: FCJF0) TIMIT speaker with derived synthesized (a1/b1) and noise-vocoded (a2/b2) variants.