Table of Contents
Fetching ...

Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li

TL;DR

The paper challenges the standard equal-stride approach in 2D ResNet for speaker verification and introduces a trellis-based stride search to optimize temporal and frequency resolutions. It defines the Golden-Gemini hypothesis, identifies two optimal stride endpoints on a five-stage trellis, and demonstrates consistent performance gains across VoxCeleb, SITW, and CNCeleb, while reducing model size and FLOPs. The work provides guiding principles for designing ASV models with temporal-first stride configurations and introduces the Gemini DF-ResNet as a new SOTA benchmark. Overall, the method offers a simple yet effective pathway to stronger, more efficient speaker verification systems with potential applicability to related speech tasks.

Abstract

Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.

Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

TL;DR

The paper challenges the standard equal-stride approach in 2D ResNet for speaker verification and introduces a trellis-based stride search to optimize temporal and frequency resolutions. It defines the Golden-Gemini hypothesis, identifies two optimal stride endpoints on a five-stage trellis, and demonstrates consistent performance gains across VoxCeleb, SITW, and CNCeleb, while reducing model size and FLOPs. The work provides guiding principles for designing ASV models with temporal-first stride configurations and introduces the Gemini DF-ResNet as a new SOTA benchmark. Overall, the method offers a simple yet effective pathway to stronger, more efficient speaker verification systems with potential applicability to related speech tasks.

Abstract

Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.
Paper Structure (20 sections, 5 equations, 5 figures, 9 tables)

This paper contains 20 sections, 5 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: The illustration of convolution operations in (a) TDNN, (b) 2D CNN with stride = (1,1), and (c) stride = (2,2). The blue and grey cuboids represent time-frequency bins of feature maps and paddings, respectively.
  • Figure 2: An exemplar trellis diagram. Each node on the trellis diagram represents the time and frequency downsampling factors, $\alpha_{n}$ and $\beta_{n}$, at the output of each stage in a ResNet. Each path represents a stride configuration consisting of five sequential stages, for $n = 1, 2, ..., 5$. The node with a circular outer ring (7,2.5)5 (7,2.5)8 indicates that it remains at the same position by using a stride of (1,1). Dashed arrows represent two alternative options controlled by different stride operations.
  • Figure 3: Trellis diagrams of (a) the strategic search for optimal stride configurations and (b) different paths towards Golden Gemini. $\gemini$ in (a) indicates proposed Golden-Gemini stride configurations. In the rectangle box, from top to bottom are: the downsampling factors ($\alpha_5, \beta_5$), performance in EER (%) on VoxCeleb-E test set, number of parameters, and FLOPs. The size of the endpoint bubble indicates the performance, and the larger the bubble, the better the performance. The node with a circular outer ring forming as (7,2.5)5 (7,2.5)8 indicates that it remains at the same position by using a stride of (1,1). The solid line represents a stride configuration that prioritizes temporal resolution over frequency resolution, while the dashed line configuration reflects the opposite.
  • Figure 4: Performance versus FLOPs and the number of parameters for different stride configurations in Fig. \ref{['fig_Trellis_diagram']}. The color is consistent with Fig. \ref{['fig_Trellis_diagram']} (a). The size of the bubble indicates the performance in EER (%) on the VoxCeleb-E test set, and the larger the bubble, the better the performance.
  • Figure 5: Performance and complexity comparison of proposed Gemini ResNet and modified ResNetBUT2019systemwespeaker with different model sizes on Vox1-E test set.