Table of Contents
Fetching ...

ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe

TL;DR

ESPnet-EZ introduces a Python-only extension to ESPnet that removes Kaldi-style Bash dependencies to enable easy fine-tuning and integration with Python-centric ML stacks. The design centers on a Trainer and ESPNetEZDataset to provide modular, data-driven tooling and Python-based data loading, compatible with PyTorch Lightning, Hugging Face, Datasets, and Lhotse. Empirical results show substantial reductions in engineering effort and dependencies while maintaining broad task coverage across ASR, ST, SLU, TTS, and low-resource languages, demonstrated through multiple fine-tuning scenarios and live integration demos. The work demonstrates that a Python-native layer can simplify practical model fine-tuning and deployment without sacrificing ESPnet’s multi-task reach, enabling faster experimentation and easier adoption in standard ML workflows.

Abstract

We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, and Lhotse. By replacing ESPnet design choices inherited from Kaldi with a Python-only, Bash-free interface, we dramatically reduce the effort required to build, debug, and use a new model. For example, to fine-tune a speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of newly written code by 2.7x and the amount of dependent code by 6.7x while dramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ is publicly available.

ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

TL;DR

ESPnet-EZ introduces a Python-only extension to ESPnet that removes Kaldi-style Bash dependencies to enable easy fine-tuning and integration with Python-centric ML stacks. The design centers on a Trainer and ESPNetEZDataset to provide modular, data-driven tooling and Python-based data loading, compatible with PyTorch Lightning, Hugging Face, Datasets, and Lhotse. Empirical results show substantial reductions in engineering effort and dependencies while maintaining broad task coverage across ASR, ST, SLU, TTS, and low-resource languages, demonstrated through multiple fine-tuning scenarios and live integration demos. The work demonstrates that a Python-native layer can simplify practical model fine-tuning and deployment without sacrificing ESPnet’s multi-task reach, enabling faster experimentation and easier adoption in standard ML workflows.

Abstract

We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, and Lhotse. By replacing ESPnet design choices inherited from Kaldi with a Python-only, Bash-free interface, we dramatically reduce the effort required to build, debug, and use a new model. For example, to fine-tune a speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of newly written code by 2.7x and the amount of dependent code by 6.7x while dramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ is publicly available.
Paper Structure (28 sections, 4 figures, 5 tables)

This paper contains 28 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Quantitative comparison of ESPnet and ESPnet-EZ. We compare the case of fine-tuning the OWSM model, a speech foundation model, for the automatic speech recognition task on a custom dataset. We use three criteria: (a) the number of new source code lines for the user to write and (b) the number of dependent source code files and (c) lines for each programming/scripting language. We observe that ESPnet-EZ significantly reduces engineering efforts compared to the original ESPnet. Newly written lines are reduced by 2.7x, and the dependent code lines and number of files are reduced by 6.7x and 6.6x, respectively. Further, ESPnet-EZ dramatically reduces the dependency on Bash and Perl.
  • Figure 2: Comparison of ESPnet and ESPnet-EZ on fine-tuning the model with a custom dataset. ESPnet has to go through dozens of shell scripts and custom modifications, whereas all the codes in ESPnet-EZ are within a single Python script.
  • Figure 3: Comparison of ESPnet and ESPnet-EZ call stack of feeding the training data. ESPnet has to format the dataset into Kaldi-style and dump it to the local directories. So, it introduces an implicit dependency between the previous data preparation step and the training step. However, ESPnet-EZ avoids the stateful dependency via on-the-fly dataset preparation from Python-based data loaders.
  • Figure 4: Summary of user feedbacks on the benefits of using ESPnet-EZ or ESPnet.