ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration
Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe
TL;DR
ESPnet-EZ introduces a Python-only extension to ESPnet that removes Kaldi-style Bash dependencies to enable easy fine-tuning and integration with Python-centric ML stacks. The design centers on a Trainer and ESPNetEZDataset to provide modular, data-driven tooling and Python-based data loading, compatible with PyTorch Lightning, Hugging Face, Datasets, and Lhotse. Empirical results show substantial reductions in engineering effort and dependencies while maintaining broad task coverage across ASR, ST, SLU, TTS, and low-resource languages, demonstrated through multiple fine-tuning scenarios and live integration demos. The work demonstrates that a Python-native layer can simplify practical model fine-tuning and deployment without sacrificing ESPnet’s multi-task reach, enabling faster experimentation and easier adoption in standard ML workflows.
Abstract
We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, and Lhotse. By replacing ESPnet design choices inherited from Kaldi with a Python-only, Bash-free interface, we dramatically reduce the effort required to build, debug, and use a new model. For example, to fine-tune a speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of newly written code by 2.7x and the amount of dependent code by 6.7x while dramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ is publicly available.
