Towards a World-English Language Model for On-Device Virtual Assistants
Rricha Jalota, Lyan Verwimp, Markus Nussbaum-Thom, Amr Mousa, Arturo Argueta, Youssef Oualil
TL;DR
The paper addresses the scalability and maintenance challenges of on-device VA language models by proposing a World-English NNLM that unifies en_US, en_GB, and en_IN dialects. It evaluates adapter-based enhancements within FOFE-based NNLMs, showing that targeted adapters can capture dialectal variation more efficiently than duplicating entire sub-networks. The authors introduce a novel architecture, AD+CAA+DA, which combines adapter strategies with Mixture FOFE to achieve a favorable accuracy-latency-memory balance, matching or surpassing single-dialect baselines while maintaining on-device feasibility. This work demonstrates that a single World-English model can replace multiple dialect-specific LMs, reducing deployment complexity and energy usage for on-device ASR systems.
Abstract
Neural Network Language Models (NNLMs) for Virtual Assistants (VAs) are generally language-, region-, and in some cases, device-dependent, which increases the effort to scale and maintain them. Combining NNLMs for one or more of the categories is one way to improve scalability. In this work, we combine regional variants of English to build a ``World English'' NNLM for on-device VAs. In particular, we investigate the application of adapter bottlenecks to model dialect-specific characteristics in our existing production NNLMs {and enhance the multi-dialect baselines}. We find that adapter modules are more effective in modeling dialects than specializing entire sub-networks. Based on this insight and leveraging the design of our production models, we introduce a new architecture for World English NNLM that meets the accuracy, latency, and memory constraints of our single-dialect models.
