WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction
Binbin Zhang, Chengdong Liang, Shuai Wang, Xuelong Geng, Zhao Guo, Haoyu Li, Hao Yin, Xipeng Yang, Pengshen Zhang, Changwei Ma, Lei Xie
TL;DR
WEST introduces a fully LLM-based, open-source, full-stack speech toolkit that unifies speech recognition, synthesis, understanding, dialogue, and multimodal interaction under a single framework. It defines data formats for pre-training and fine-tuning, employs sequence packing to improve training efficiency, and ships a family of built-in models (TouchASU, TouchTTS, TouchChat, TouchChat2, TouchOmni) along with support for open-source models. Experimental results across ASR, QA, TTS, and speech chat demonstrate competitive performance and practical viability, while data-pack experiments highlight substantial training speedups. The work emphasizes reproducibility and accessible deployment, with an explicit roadmap toward a stable 1.0 release and ongoing improvements across data, models, and evaluation.
Abstract
In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at https://github.com/wenet-e2e/west/
