OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe
TL;DR
This paper addresses the gap in open Whisper-style speech models by introducing OWSM v3.1, which replaces the standard Transformer encoder with the more capable E-Branchformer to boost both performance and efficiency without expanding the training data. The authors demonstrate across English and multilingual ASR, speech translation, LID, SLU, long-form ASR, and zero-shot biasing that OWSM v3.1 delivers consistent gains and up to about 25% faster inference compared with OWSM v3, often surpassing Whisper on several benchmarks. They further reveal an emergent zero-shot contextual biasing capability that strengthens performance when conditioning on textual prompts, particularly for larger models. The work emphasizes openness by releasing code, pre-trained weights, and training logs, and discusses data-diversity and model-compression as future directions to further enhance accessibility and applicability of open speech foundation models.
Abstract
Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder architectures. This work aims to improve the performance and efficiency of OWSM without additional data. We present a series of E-Branchformer-based models named OWSM v3.1, ranging from 100M to 1B parameters. OWSM v3.1 outperforms its predecessor, OWSM v3, in most evaluation benchmarks, while showing an improved inference speed of up to 25%. We further reveal the emergent ability of OWSM v3.1 in zero-shot contextual biasing speech recognition. We also provide a model trained on a subset of data with low license restrictions. We will publicly release the code, pre-trained models, and training logs.
