Table of Contents
Fetching ...

SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech

Lu Gan, Xi Li

TL;DR

This work tackles the data scarcity barrier for on-device keyword spotting in TinyML by introducing SynTTS-Commands, a multilingual KWS dataset synthesized with CosyVoice 2. The authors demonstrate that high-quality synthetic speech can achieve near state-of-the-art performance across English and Chinese command recognition using a range of efficient models, with English accuracies above 99% and Chinese near 98%. The dataset combines VoxCeleb speaker embeddings with the Free ST Chinese Mandarin Corpus, uses rigorous quality filtering and language-specific partitions, and is publicly released. By enabling scalable, private, and low-latency voice interfaces on edge devices, this work lays a foundation for expanding multilingual wake-word systems and domain coverage in the TinyML era.

Abstract

The development of high-performance, on-device keyword spotting (KWS) systems for ultra-low-power hardware is critically constrained by the scarcity of specialized, multi-command training datasets. Traditional data collection through human recording is costly, slow, and lacks scalability. This paper introduces SYNTTS-COMMANDS, a novel, multilingual voice command dataset entirely generated using state-of-the-art Text-to-Speech (TTS) synthesis. By leveraging the CosyVoice 2 model and speaker embeddings from public corpora, we created a scalable collection of English and Chinese commands. Extensive benchmarking across a range of efficient acoustic models demonstrates that our synthetic dataset enables exceptional accuracy, achieving up to 99.5\% on English and 98\% on Chinese command recognition. These results robustly validate that synthetic speech can effectively replace human-recorded audio for training KWS classifiers. Our work directly addresses the data bottleneck in TinyML, providing a practical, scalable foundation for building private, low-latency, and energy-efficient voice interfaces on resource-constrained edge devices. The dataset and source code are publicly available at https://github.com/lugan113/SynTTS-Commands-Official.

SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech

TL;DR

This work tackles the data scarcity barrier for on-device keyword spotting in TinyML by introducing SynTTS-Commands, a multilingual KWS dataset synthesized with CosyVoice 2. The authors demonstrate that high-quality synthetic speech can achieve near state-of-the-art performance across English and Chinese command recognition using a range of efficient models, with English accuracies above 99% and Chinese near 98%. The dataset combines VoxCeleb speaker embeddings with the Free ST Chinese Mandarin Corpus, uses rigorous quality filtering and language-specific partitions, and is publicly released. By enabling scalable, private, and low-latency voice interfaces on edge devices, this work lays a foundation for expanding multilingual wake-word systems and domain coverage in the TinyML era.

Abstract

The development of high-performance, on-device keyword spotting (KWS) systems for ultra-low-power hardware is critically constrained by the scarcity of specialized, multi-command training datasets. Traditional data collection through human recording is costly, slow, and lacks scalability. This paper introduces SYNTTS-COMMANDS, a novel, multilingual voice command dataset entirely generated using state-of-the-art Text-to-Speech (TTS) synthesis. By leveraging the CosyVoice 2 model and speaker embeddings from public corpora, we created a scalable collection of English and Chinese commands. Extensive benchmarking across a range of efficient acoustic models demonstrates that our synthetic dataset enables exceptional accuracy, achieving up to 99.5\% on English and 98\% on Chinese command recognition. These results robustly validate that synthetic speech can effectively replace human-recorded audio for training KWS classifiers. Our work directly addresses the data bottleneck in TinyML, providing a practical, scalable foundation for building private, low-latency, and energy-efficient voice interfaces on resource-constrained edge devices. The dataset and source code are publicly available at https://github.com/lugan113/SynTTS-Commands-Official.

Paper Structure

This paper contains 20 sections, 1 equation, 7 tables.