Table of Contents
Fetching ...

Compact Neural TTS Voices for Accessibility

Kunal Jain, Eoin Murphy, Deepanshu Gupta, Jonathan Dyke, Saumya Shah, Vasilieios Tsiaras, Petko Petkov, Alistair Conkie

TL;DR

The paper tackles the challenge of delivering high-quality TTS for accessibility on resource-constrained devices, where traditional neural TTS models are too large or slow. It proposes a compact three-component pipeline (Text Processing Frontend, FastSpeech2-based Acoustic Model, WaveRNN Vocoder) and applies a suite of optimizations—quantization to INT8, weight sharing, sparsity, KV caching, and subscale generation—to dramatically reduce disk footprint and latency. Empirical results show footprint reduction from 73 MB to 18 MB and latency from 27 ms to 13 ms (with MOS remaining near baseline, e.g., ~4.09 vs 4.19), demonstrating viability for on-device accessibility use. The work also discusses scalability to multiple voices and languages via a multilingual frontend and points toward future end-to-end models to further improve naturalness and robustness.

Abstract

Contemporary text-to-speech solutions for accessibility applications can typically be classified into two categories: (i) device-based statistical parametric speech synthesis (SPSS) or unit selection (USEL) and (ii) cloud-based neural TTS. SPSS and USEL offer low latency and low disk footprint at the expense of naturalness and audio quality. Cloud-based neural TTS systems provide significantly better audio quality and naturalness but regress in terms of latency and responsiveness, rendering these impractical for real-world applications. More recently, neural TTS models were made deployable to run on handheld devices. Nevertheless, latency remains higher than SPSS and USEL, while disk footprint prohibits pre-installation for multiple voices at once. In this work, we describe a high-quality compact neural TTS system achieving latency on the order of 15 ms with low disk footprint. The proposed solution is capable of running on low-power devices.

Compact Neural TTS Voices for Accessibility

TL;DR

The paper tackles the challenge of delivering high-quality TTS for accessibility on resource-constrained devices, where traditional neural TTS models are too large or slow. It proposes a compact three-component pipeline (Text Processing Frontend, FastSpeech2-based Acoustic Model, WaveRNN Vocoder) and applies a suite of optimizations—quantization to INT8, weight sharing, sparsity, KV caching, and subscale generation—to dramatically reduce disk footprint and latency. Empirical results show footprint reduction from 73 MB to 18 MB and latency from 27 ms to 13 ms (with MOS remaining near baseline, e.g., ~4.09 vs 4.19), demonstrating viability for on-device accessibility use. The work also discusses scalability to multiple voices and languages via a multilingual frontend and points toward future end-to-end models to further improve naturalness and robustness.

Abstract

Contemporary text-to-speech solutions for accessibility applications can typically be classified into two categories: (i) device-based statistical parametric speech synthesis (SPSS) or unit selection (USEL) and (ii) cloud-based neural TTS. SPSS and USEL offer low latency and low disk footprint at the expense of naturalness and audio quality. Cloud-based neural TTS systems provide significantly better audio quality and naturalness but regress in terms of latency and responsiveness, rendering these impractical for real-world applications. More recently, neural TTS models were made deployable to run on handheld devices. Nevertheless, latency remains higher than SPSS and USEL, while disk footprint prohibits pre-installation for multiple voices at once. In this work, we describe a high-quality compact neural TTS system achieving latency on the order of 15 ms with low disk footprint. The proposed solution is capable of running on low-power devices.

Paper Structure

This paper contains 13 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Neural Components in the Baseline Architecture
  • Figure 2: Frontend parameter sharing. The same color indicates same parameters. The left are the layers of encoder and right is the only layer of decoder in the optimized frontend. As shown, we do sharing across encoder layers as well as across encoder and decoder for attention weights. Bias terms are independent for layer specific adaptation
  • Figure 3: Subscale WaveRNN produces two samples per iteration. The hidden-to-hidden matrix and the linear layers of the post-nets are sparse matrices, while the input-to-hidden matrix is dense. The hidden dimension is denoted by $h$, while the output dimension is denoted by $d$. Samples are produced by sampling the categorical distribution of the corresponding softmax.