Version Control of Speaker Recognition Systems
Quan Wang, Ignacio Lopez Moreno
TL;DR
This work defines the version-control problem for production speaker recognition by aligning enrollment profiles with evolving speech engines across device-side, server-side, and hybrid deployments. It introduces a lightweight simulation framework, SpeakerVerSim, to evaluate strategy trade-offs under realistic network conditions and workloads. The study finds that server-side double-version updating offers the best balance of availability, latency, and computational cost, while single-version online updating and its variants suffer from version bouncing and latency spikes. The results provide practical guidance for deployment choices and underscore the value of domain-specific simulation in engineering biometric systems, with the SpeakerVerSim framework openly available for reuse.
Abstract
This paper discusses one of the most challenging practical engineering problems in speaker recognition systems - the version control of models and user profiles. A typical speaker recognition system consists of two stages: the enrollment stage, where a profile is generated from user-provided enrollment audio; and the runtime stage, where the voice identity of the runtime audio is compared against the stored profiles. As technology advances, the speaker recognition system needs to be updated for better performance. However, if the stored user profiles are not updated accordingly, version mismatch will result in meaningless recognition results. In this paper, we describe different version control strategies for speaker recognition systems that had been carefully studied at Google from years of engineering practice. These strategies are categorized into three groups according to how they are deployed in the production environment: device-side deployment, server-side deployment, and hybrid deployment. To compare different strategies with quantitative metrics under various network configurations, we present SpeakerVerSim, an easily-extensible Python-based simulation framework for different server-side deployment strategies of speaker recognition systems.
