Koopman Regularized Deep Speech Disentanglement for Speaker Verification

Nikos Chazaridis; Mohammad Belal; Rafael Mestre; Timothy J. Norman; Christine Evers

Koopman Regularized Deep Speech Disentanglement for Speaker Verification

Nikos Chazaridis, Mohammad Belal, Rafael Mestre, Timothy J. Norman, Christine Evers

TL;DR

This work proposes Deep Koopman Speech Disentanglement Autoencoder (DKSD-AE), a structured autoencoder that combines a novel multi-step Koopman operator learning module with instance normalization to disentangle speaker and content dynamics and suggests that Koopman-based temporal modelling, when combined with instance normalization, provides an efficient and principled solution for speaker-focused representation learning.

Abstract

Human speech contains both linguistic content and speaker dependent characteristics making speaker verification a key technology in identity critical applications. Modern deep learning speaker verification systems aim to learn speaker representations that are invariant to semantic content and nuisance factors such as ambient noise. However, many existing approaches depend on labelled data, textual supervision or large pretrained models as feature extractors, limiting scalability and practical deployment, raising sustainability concerns. We propose Deep Koopman Speech Disentanglement Autoencoder (DKSD-AE), a structured autoencoder that combines a novel multi-step Koopman operator learning module with instance normalization to disentangle speaker and content dynamics. Quantitative experiments across multiple datasets demonstrate that DKSD-AE achieves improved or competitive speaker verification performance compared to state-of-the-art baselines while maintaining high content EER, confirming effective disentanglement. These results are obtained with substantially fewer parameters and without textual supervision. Moreover, performance remains stable under increased evaluation scale, highlighting representation robustness and generalization. Our findings suggest that Koopman-based temporal modelling, when combined with instance normalization, provides an efficient and principled solution for speaker-focused representation learning.

Koopman Regularized Deep Speech Disentanglement for Speaker Verification

TL;DR

Abstract

Paper Structure (22 sections, 21 equations, 4 figures, 5 tables)

This paper contains 22 sections, 21 equations, 4 figures, 5 tables.

Introduction
Related Work
Disentangled Speaker Representations
Operator Theoretic Representations
Preliminaries
Koopman Operator Theory
Autoencoders for Finite Koopman Approximation
Proposed Method
Koopman Operator Learning for Disentanglement
Dynamics Encoder
Content Encoder
Decoder
Masked Augmentation to Capture Intra-Speaker Variation
Experimental Setup
Training
...and 7 more sections

Figures (4)

Figure 1: The DKSD-AE framework architecture. Speech utterances are processed with VAD and then mel-spectrograms are extracted. Following, the input mel-spectrograms $\mathbf{X}$ are fed to the dynamics encoder $f_\text{dyn}$, to learn the Koopman operator $\mathbf{K}$ and the speaker identity representation $\mathbf{Z}_s$. Concurrently, $\mathbf{X}$ is fed to $f_c$, the content encoder, to learn a content representation $\mathbf{Z}_c$ via instance normalization. Finally, $\mathbf{Z}_s$ and $\mathbf{Z}_c$ are concatenated and fed to the decoder, $q_\text{dec}$, to generate reconstructed mel-spectrograms $\mathbf{\widehat{X}}$.
Figure 2: Processing steps to estimate the multi-step prediction loss $\mathcal{L}_{\text{pred}}$. From top to bottom, first a prefix $\mathbf{Z}_s^{\mathrm{pre}}$ is extracted from the full dynamics representation $\mathbf{Z}_s$. Then $\mathbf{Z}_s^{\mathrm{pre}}$ is split into two subsequent time shifted views, used to solve the the Koopman operator regression shown in Eq. \ref{['regularized_K_eq']}.
Figure 3: Visual comparison of speaker $\mathbf{Z}_s$ (left) and content representations $\mathbf{Z}_c$ (right), after dimensionality reduction with PCA and t-SNE maaten_t_sne_2008. Same speaker representations $\mathbf{Z}_s$ (left) form compact well-separated clusters in 2D, whereas content representations $\mathbf{Z}_c$ are dispersed without speaker-specific grouping (right). Different colours correspond to different speaker classes.
Figure 4: Our multi-step Koopman operator learning module formulation presents improvements in speaker EER when the forecasting horizon $M \in [5,15]$ for VCTK and when $M \in [5,13]$ for TIMIT although less pronounced. The identified effective $M$-range corresponds to forecasting horizon length between $5 \%$ and $10\%$ of the initial speech utterance duration.

Koopman Regularized Deep Speech Disentanglement for Speaker Verification

TL;DR

Abstract

Koopman Regularized Deep Speech Disentanglement for Speaker Verification

Authors

TL;DR

Abstract

Table of Contents

Figures (4)