Table of Contents
Fetching ...

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, Shinji Watanabe

TL;DR

Heterogeneous preprocessing across 25 public speech datasets complicates training of Open Whisper-style S2T foundation models. The authors introduce OWSM v3.2, which combines proxy-task data filtering to improve data quality with LLM-based punctuation and true-casing restoration, while keeping the architecture and reducing training data by about 15%. Results show marked gains in speech translation performance and long-form robustness, with outputs that better align with written language, though some language-specific nuances depend on the reference quality of the LLM. Overall, the work demonstrates that data quality and formatting can drive substantial improvements in open, transparent S2T systems without increasing model size.

Abstract

The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity. Our study begins with a detailed analysis of each dataset, from which we derive two key strategies: data filtering with proxy task to enhance data quality, and the incorporation of punctuation and true-casing using an open large language model (LLM). With all other configurations staying the same, OWSM v3.2 improves performance over the OWSM v3.1 baseline while using 15% less training data.

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

TL;DR

Heterogeneous preprocessing across 25 public speech datasets complicates training of Open Whisper-style S2T foundation models. The authors introduce OWSM v3.2, which combines proxy-task data filtering to improve data quality with LLM-based punctuation and true-casing restoration, while keeping the architecture and reducing training data by about 15%. Results show marked gains in speech translation performance and long-form robustness, with outputs that better align with written language, though some language-specific nuances depend on the reference quality of the LLM. Overall, the work demonstrates that data quality and formatting can drive substantial improvements in open, transparent S2T systems without increasing model size.

Abstract

The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity. Our study begins with a detailed analysis of each dataset, from which we derive two key strategies: data filtering with proxy task to enhance data quality, and the incorporation of punctuation and true-casing using an open large language model (LLM). With all other configurations staying the same, OWSM v3.2 improves performance over the OWSM v3.1 baseline while using 15% less training data.
Paper Structure (14 sections, 1 figure, 6 tables)