Compact Neural TTS Voices for Accessibility
Kunal Jain, Eoin Murphy, Deepanshu Gupta, Jonathan Dyke, Saumya Shah, Vasilieios Tsiaras, Petko Petkov, Alistair Conkie
TL;DR
The paper tackles the challenge of delivering high-quality TTS for accessibility on resource-constrained devices, where traditional neural TTS models are too large or slow. It proposes a compact three-component pipeline (Text Processing Frontend, FastSpeech2-based Acoustic Model, WaveRNN Vocoder) and applies a suite of optimizations—quantization to INT8, weight sharing, sparsity, KV caching, and subscale generation—to dramatically reduce disk footprint and latency. Empirical results show footprint reduction from 73 MB to 18 MB and latency from 27 ms to 13 ms (with MOS remaining near baseline, e.g., ~4.09 vs 4.19), demonstrating viability for on-device accessibility use. The work also discusses scalability to multiple voices and languages via a multilingual frontend and points toward future end-to-end models to further improve naturalness and robustness.
Abstract
Contemporary text-to-speech solutions for accessibility applications can typically be classified into two categories: (i) device-based statistical parametric speech synthesis (SPSS) or unit selection (USEL) and (ii) cloud-based neural TTS. SPSS and USEL offer low latency and low disk footprint at the expense of naturalness and audio quality. Cloud-based neural TTS systems provide significantly better audio quality and naturalness but regress in terms of latency and responsiveness, rendering these impractical for real-world applications. More recently, neural TTS models were made deployable to run on handheld devices. Nevertheless, latency remains higher than SPSS and USEL, while disk footprint prohibits pre-installation for multiple voices at once. In this work, we describe a high-quality compact neural TTS system achieving latency on the order of 15 ms with low disk footprint. The proposed solution is capable of running on low-power devices.
