Table of Contents
Fetching ...

What do neural networks listen to? Exploring the crucial bands in Speech Enhancement using Sinc-convolution

Kuan-Hsun Ho, Jeih-weih Hung, Berlin Chen

TL;DR

The paper addresses understanding which frequency components are prioritized in speech enhancement and introduces a reformulated Sinc-conv encoder that learns cutoff frequencies via normalized parameters with per-filter gains. By constraining cutoffs with Nyquist-aware normalization and providing diverse initialization options, the method improves training efficiency and interpretability while boosting SE performance when integrated into Conv-TasNet and MANNER. Empirical results show notable PESQ, STOI, and SI-SNR gains and a substantial parameter reduction for Conv-TasNet, with ablations highlighting the importance of initialization and normalization. CFR analysis reveals that the reformed Sinc-conv shifts attention toward mid-frequency bands, offering practical insights into the operating dynamics of SE models and enabling more informed design of learnable filterbanks.

Abstract

This study introduces a reformed Sinc-convolution (Sincconv) framework tailored for the encoder component of deep networks for speech enhancement (SE). The reformed Sincconv, based on parametrized sinc functions as band-pass filters, offers notable advantages in terms of training efficiency, filter diversity, and interpretability. The reformed Sinc-conv is evaluated in conjunction with various SE models, showcasing its ability to boost SE performance. Furthermore, the reformed Sincconv provides valuable insights into the specific frequency components that are prioritized in an SE scenario. This opens up a new direction of SE research and improving our knowledge of their operating dynamics.

What do neural networks listen to? Exploring the crucial bands in Speech Enhancement using Sinc-convolution

TL;DR

The paper addresses understanding which frequency components are prioritized in speech enhancement and introduces a reformulated Sinc-conv encoder that learns cutoff frequencies via normalized parameters with per-filter gains. By constraining cutoffs with Nyquist-aware normalization and providing diverse initialization options, the method improves training efficiency and interpretability while boosting SE performance when integrated into Conv-TasNet and MANNER. Empirical results show notable PESQ, STOI, and SI-SNR gains and a substantial parameter reduction for Conv-TasNet, with ablations highlighting the importance of initialization and normalization. CFR analysis reveals that the reformed Sinc-conv shifts attention toward mid-frequency bands, offering practical insights into the operating dynamics of SE models and enabling more informed design of learnable filterbanks.

Abstract

This study introduces a reformed Sinc-convolution (Sincconv) framework tailored for the encoder component of deep networks for speech enhancement (SE). The reformed Sincconv, based on parametrized sinc functions as band-pass filters, offers notable advantages in terms of training efficiency, filter diversity, and interpretability. The reformed Sinc-conv is evaluated in conjunction with various SE models, showcasing its ability to boost SE performance. Furthermore, the reformed Sincconv provides valuable insights into the specific frequency components that are prioritized in an SE scenario. This opens up a new direction of SE research and improving our knowledge of their operating dynamics.
Paper Structure (11 sections, 6 equations, 2 figures, 3 tables)

This paper contains 11 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: CFR.
  • Figure 2: The cutoff frequencies and BGs, sorted by the lower frequency.