NeuralMultiling: A Novel Neural Architecture Search for Smartphone based Multilingual Speaker Verification
Aravinda Reddy PN, Raghavendra Ramachandra, K. Sreenivasa Rao, Pabitra Mitra
TL;DR
The paper tackles smartphone-based multilingual speaker verification by automating architecture design with differentiable NAS, introducing separate normal and reduction cells to enhance speaker feature capture while maintaining lightweight models. The NeuralMultiling framework defines neural cells as 7-node DAGs with a rich candidate operation set and employs continuous relaxation and bi-level optimization to derive discrete architectures, increasing search space by differentiating normal and reduction cell parameters. Evaluations on the MAVS dataset demonstrate that the NAS-derived model significantly outperforms the Autospeech baseline with about 5–6% lower Equal Error Rate (EER) and fewer parameters, under language-agnostic and cross-device/interoperability conditions. The work provides a practical path for deploying multilingual speaker verification on mobile devices, offering a lightweight, high-accuracy solution and demonstrating the benefits of tailored NAS for security-focused biometric systems. The NAS framework optimizes $L_{val}$ with architecture parameters while training weights to minimize $L_{train}$, enabling robust adaptation across languages and devices.
Abstract
Multilingual speaker verification introduces the challenge of verifying a speaker in multiple languages. Existing systems were built using i-vector/x-vector approaches along with Bi-LSTMs, which were trained to discriminate speakers, irrespective of the language. Instead of exploring the design space manually, we propose a neural architecture search for multilingual speaker verification suitable for mobile devices, called \textbf{NeuralMultiling}. First, our algorithm searches for an optimal operational combination of neural cells with different architectures for normal cells and reduction cells and then derives a CNN model by stacking neural cells. Using the derived architecture, we performed two different studies:1) language agnostic condition and 2) interoperability between languages and devices on the publicly available Multilingual Audio-Visual Smartphone (MAVS) dataset. The experimental results suggest that the derived architecture significantly outperforms the existing Autospeech method by a 5-6\% reduction in the Equal Error Rate (EER) with fewer model parameters.
