Towards a World-English Language Model for On-Device Virtual Assistants

Rricha Jalota; Lyan Verwimp; Markus Nussbaum-Thom; Amr Mousa; Arturo Argueta; Youssef Oualil

Towards a World-English Language Model for On-Device Virtual Assistants

Rricha Jalota, Lyan Verwimp, Markus Nussbaum-Thom, Amr Mousa, Arturo Argueta, Youssef Oualil

TL;DR

The paper addresses the scalability and maintenance challenges of on-device VA language models by proposing a World-English NNLM that unifies en_US, en_GB, and en_IN dialects. It evaluates adapter-based enhancements within FOFE-based NNLMs, showing that targeted adapters can capture dialectal variation more efficiently than duplicating entire sub-networks. The authors introduce a novel architecture, AD+CAA+DA, which combines adapter strategies with Mixture FOFE to achieve a favorable accuracy-latency-memory balance, matching or surpassing single-dialect baselines while maintaining on-device feasibility. This work demonstrates that a single World-English model can replace multiple dialect-specific LMs, reducing deployment complexity and energy usage for on-device ASR systems.

Abstract

Neural Network Language Models (NNLMs) for Virtual Assistants (VAs) are generally language-, region-, and in some cases, device-dependent, which increases the effort to scale and maintain them. Combining NNLMs for one or more of the categories is one way to improve scalability. In this work, we combine regional variants of English to build a ``World English'' NNLM for on-device VAs. In particular, we investigate the application of adapter bottlenecks to model dialect-specific characteristics in our existing production NNLMs {and enhance the multi-dialect baselines}. We find that adapter modules are more effective in modeling dialects than specializing entire sub-networks. Based on this insight and leveraging the design of our production models, we introduce a new architecture for World English NNLM that meets the accuracy, latency, and memory constraints of our single-dialect models.

Towards a World-English Language Model for On-Device Virtual Assistants

TL;DR

Abstract

Paper Structure (7 sections, 1 figure, 3 tables)

This paper contains 7 sections, 1 figure, 3 tables.

Introduction
Model Architecture
Baseline FOFE-based NNLMs
World-English NNLMs
Experimental Setup
Results
Conclusion

Figures (1)

Figure 1: FOFE-based NNLM Architectures. The components in blue denote feedforward layers. US, GB, IN refer to American, British and Indian English. The abbreviation C in figures \ref{['fig:dualcls']} and \ref{['fig:newarc']} refers to the Common Dialect Adapter and CAA refers to Common Application Adapter. Figure (\ref{['fig:mixture']}): Mixture FOFE model, (\ref{['fig:clsdep']}): Multi-dialect AD FOFE (AD), (\ref{['fig:dualcls']}): AD FOFE with Dual Adapters (AD+DA) and (\ref{['fig:newarc']}): AD FOFE with CAA and Dual Adapters (AD+CAA+DA).

Towards a World-English Language Model for On-Device Virtual Assistants

TL;DR

Abstract

Towards a World-English Language Model for On-Device Virtual Assistants

Authors

TL;DR

Abstract

Table of Contents

Figures (1)