Table of Contents
Fetching ...

Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Wangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe, Yanmin Qian

TL;DR

This work explores the scalability of SE models in terms of architectures, model sizes, compute budgets, and dataset sizes to provide insights into the under-explored SE directions, e.g., larger-scale multi-domain corpora and efficiently scalable architectures.

Abstract

Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement. In this paper, we aim to provide new insights for addressing the above issues by exploring the scalability of SE models in terms of architectures, model sizes, compute budgets, and dataset sizes. Our investigation involves several popular SE architectures and speech data from different domains. Experiments reveal both similarities and distinctions between the scaling effects in SE and other tasks such as speech recognition. These findings further provide insights into the under-explored SE directions, e.g., larger-scale multi-domain corpora and efficiently scalable architectures.

Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

TL;DR

This work explores the scalability of SE models in terms of architectures, model sizes, compute budgets, and dataset sizes to provide insights into the under-explored SE directions, e.g., larger-scale multi-domain corpora and efficiently scalable architectures.

Abstract

Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement. In this paper, we aim to provide new insights for addressing the above issues by exploring the scalability of SE models in terms of architectures, model sizes, compute budgets, and dataset sizes. Our investigation involves several popular SE architectures and speech data from different domains. Experiments reveal both similarities and distinctions between the scaling effects in SE and other tasks such as speech recognition. These findings further provide insights into the under-explored SE directions, e.g., larger-scale multi-domain corpora and efficiently scalable architectures.
Paper Structure (10 sections, 1 equation, 3 figures, 4 tables)

This paper contains 10 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Scaling effect of BSRNN with respect to model complexity (#MACs at 48 kHz). Each data point corresponds to an independent model. Causal model setups: (a)--(e). Non-causal model setups: (f)--(j).
  • Figure 2: Scaling effect of non-causal BSRNN with respect to dataset sizes (#Data). Main results are shown in white regions, while shaded regions illustrate the degradation caused by improper data scaling (detailed in § \ref{['ssec:exp_causality']}).
  • Figure 3: Scaling effect of non-causal SE models with different architectures on 157 h of training data.