M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions

Shuai Wang; Pengcheng Zhu; Haizhou Li

M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions

Shuai Wang, Pengcheng Zhu, Haizhou Li

TL;DR

This work tackles the inefficiency of fixed-dimensional speaker embeddings by introducing Matryoshka Representation Learning (MRL), which enables nested, variable-dimension embeddings that can be extracted at inference without retraining. By jointly training multiple sub-dimensions under an AAM-Softmax framework, the method maintains discriminability even at very low dimensions (e.g., $8$ or $16$), as demonstrated on VoxCeleb data. The key contributions are the MRL loss that aggregates across dimensions, and the practical demonstration that significant storage and retrieval-time reductions are achievable with minimal performance loss, making large-scale speaker databases more scalable. The approach is compatible with existing encoders and can be extended to other tasks requiring flexible embedding dimensionality, offering tangible benefits for deployment and retrieval efficiency.

Abstract

Fixed-dimensional speaker embeddings have become the dominant approach in speaker modeling, typically spanning hundreds to thousands of dimensions. These dimensions are hyperparameters that are not specifically picked, nor are they hierarchically ordered in terms of importance. In large-scale speaker representation databases, reducing the dimensionality of embeddings can significantly lower storage and computational costs. However, directly training low-dimensional representations often yields suboptimal performance. In this paper, we introduce the Matryoshka speaker embedding, a method that allows dynamic extraction of sub-dimensions from the embedding while maintaining performance. Our approach is validated on the VoxCeleb dataset, demonstrating that it can achieve extremely low-dimensional embeddings, such as 8 dimensions, while preserving high speaker verification performance.

M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions

TL;DR

), as demonstrated on VoxCeleb data. The key contributions are the MRL loss that aggregates across dimensions, and the practical demonstration that significant storage and retrieval-time reductions are achievable with minimal performance loss, making large-scale speaker databases more scalable. The approach is compatible with existing encoders and can be extended to other tasks requiring flexible embedding dimensionality, offering tangible benefits for deployment and retrieval efficiency.

Abstract

Paper Structure (17 sections, 3 equations, 2 figures, 2 tables)

This paper contains 17 sections, 3 equations, 2 figures, 2 tables.

Introduction
Speaker Modeling for Human-Computer Interaction
Background on speaker embedding learning
Extremely Low-Dimensional Embeddings
Contributions
Matryoshka Embedding Learning
Matryoshka Representation
MRL for Speaker Embedding Learning
Experiments
Dataset
Experimental Setups
Results and Analysis
Comparison of Embeddings with Different Dimensions
Extremely Low Dimensional Embeddings
Analysis on the Storage and Retivial Time
...and 2 more sections

Figures (2)

Figure 1: Matryoshca Speaker Embedding Learning, we use 3 sub-dimensional embeddings as a illustration
Figure 2: Performance comparison of different systems using different dimensions

M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions

TL;DR

Abstract

M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions

Authors

TL;DR

Abstract

Table of Contents

Figures (2)