Table of Contents
Fetching ...

NeuralMultiling: A Novel Neural Architecture Search for Smartphone based Multilingual Speaker Verification

Aravinda Reddy PN, Raghavendra Ramachandra, K. Sreenivasa Rao, Pabitra Mitra

TL;DR

The paper tackles smartphone-based multilingual speaker verification by automating architecture design with differentiable NAS, introducing separate normal and reduction cells to enhance speaker feature capture while maintaining lightweight models. The NeuralMultiling framework defines neural cells as 7-node DAGs with a rich candidate operation set and employs continuous relaxation and bi-level optimization to derive discrete architectures, increasing search space by differentiating normal and reduction cell parameters. Evaluations on the MAVS dataset demonstrate that the NAS-derived model significantly outperforms the Autospeech baseline with about 5–6% lower Equal Error Rate (EER) and fewer parameters, under language-agnostic and cross-device/interoperability conditions. The work provides a practical path for deploying multilingual speaker verification on mobile devices, offering a lightweight, high-accuracy solution and demonstrating the benefits of tailored NAS for security-focused biometric systems. The NAS framework optimizes $L_{val}$ with architecture parameters while training weights to minimize $L_{train}$, enabling robust adaptation across languages and devices.

Abstract

Multilingual speaker verification introduces the challenge of verifying a speaker in multiple languages. Existing systems were built using i-vector/x-vector approaches along with Bi-LSTMs, which were trained to discriminate speakers, irrespective of the language. Instead of exploring the design space manually, we propose a neural architecture search for multilingual speaker verification suitable for mobile devices, called \textbf{NeuralMultiling}. First, our algorithm searches for an optimal operational combination of neural cells with different architectures for normal cells and reduction cells and then derives a CNN model by stacking neural cells. Using the derived architecture, we performed two different studies:1) language agnostic condition and 2) interoperability between languages and devices on the publicly available Multilingual Audio-Visual Smartphone (MAVS) dataset. The experimental results suggest that the derived architecture significantly outperforms the existing Autospeech method by a 5-6\% reduction in the Equal Error Rate (EER) with fewer model parameters.

NeuralMultiling: A Novel Neural Architecture Search for Smartphone based Multilingual Speaker Verification

TL;DR

The paper tackles smartphone-based multilingual speaker verification by automating architecture design with differentiable NAS, introducing separate normal and reduction cells to enhance speaker feature capture while maintaining lightweight models. The NeuralMultiling framework defines neural cells as 7-node DAGs with a rich candidate operation set and employs continuous relaxation and bi-level optimization to derive discrete architectures, increasing search space by differentiating normal and reduction cell parameters. Evaluations on the MAVS dataset demonstrate that the NAS-derived model significantly outperforms the Autospeech baseline with about 5–6% lower Equal Error Rate (EER) and fewer parameters, under language-agnostic and cross-device/interoperability conditions. The work provides a practical path for deploying multilingual speaker verification on mobile devices, offering a lightweight, high-accuracy solution and demonstrating the benefits of tailored NAS for security-focused biometric systems. The NAS framework optimizes with architecture parameters while training weights to minimize , enabling robust adaptation across languages and devices.

Abstract

Multilingual speaker verification introduces the challenge of verifying a speaker in multiple languages. Existing systems were built using i-vector/x-vector approaches along with Bi-LSTMs, which were trained to discriminate speakers, irrespective of the language. Instead of exploring the design space manually, we propose a neural architecture search for multilingual speaker verification suitable for mobile devices, called \textbf{NeuralMultiling}. First, our algorithm searches for an optimal operational combination of neural cells with different architectures for normal cells and reduction cells and then derives a CNN model by stacking neural cells. Using the derived architecture, we performed two different studies:1) language agnostic condition and 2) interoperability between languages and devices on the publicly available Multilingual Audio-Visual Smartphone (MAVS) dataset. The experimental results suggest that the derived architecture significantly outperforms the existing Autospeech method by a 5-6\% reduction in the Equal Error Rate (EER) with fewer model parameters.
Paper Structure (14 sections, 4 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 4 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of speech signal and corresponding spectrogram of the different languages uttered by the same subject
  • Figure 2: Depiction of a neural cell. The transitional nodes($x_2$ to $x_5$) are thickly connected during the search process. Only two operations with the highest softmax probabilities are retained during architecture derivation for the transitional nodes.
  • Figure 3: a) illustration of neural architecture search, b) illustration of search space between node u,v the d) obtained different architecture for normal and reduction cell
  • Figure 4: An overview of Continuous relaxation: a) Initial architecture with unknown operations. b) Continuous relaxation of the searched space on each of the edges by setting up candidate operations. c) Two-way optimization of network weights and probabilities of each node. d) $\&$ e) Spawning the final architecture from the learned probabilities for normal cell and reduction cell.
  • Figure 5: Normal cell: Architecture derived from our proposed search algorithm
  • ...and 2 more figures