Table of Contents
Fetching ...

RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement

Ibrahim Aldarmaki, Thamar Solorio, Bhiksha Raj, Hanan Aldarmaki

TL;DR

A novel modification of these models by incorporating relative information from the outset, where each channel is processed in conjunction with a reference channel through stacking, thereby capturing crucial spatial information and enhancing the overall performance.

Abstract

Neural multi-channel speech enhancement models, in particular those based on the U-Net architecture, demonstrate promising performance and generalization potential. These models typically encode input channels independently, and integrate the channels during later stages of the network. In this paper, we propose a novel modification of these models by incorporating relative information from the outset, where each channel is processed in conjunction with a reference channel through stacking. This input strategy exploits comparative differences to adaptively fuse information between channels, thereby capturing crucial spatial information and enhancing the overall performance. The experiments conducted on the CHiME-3 dataset demonstrate improvements in speech enhancement metrics across various architectures.

RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement

TL;DR

A novel modification of these models by incorporating relative information from the outset, where each channel is processed in conjunction with a reference channel through stacking, thereby capturing crucial spatial information and enhancing the overall performance.

Abstract

Neural multi-channel speech enhancement models, in particular those based on the U-Net architecture, demonstrate promising performance and generalization potential. These models typically encode input channels independently, and integrate the channels during later stages of the network. In this paper, we propose a novel modification of these models by incorporating relative information from the outset, where each channel is processed in conjunction with a reference channel through stacking. This input strategy exploits comparative differences to adaptively fuse information between channels, thereby capturing crucial spatial information and enhancing the overall performance. The experiments conducted on the CHiME-3 dataset demonstrate improvements in speech enhancement metrics across various architectures.
Paper Structure (15 sections, 10 equations, 2 figures, 2 tables)

This paper contains 15 sections, 10 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Illustration of the proposed multi-channel speech enhancement method using a RelUNet.
  • Figure 2: Spectrograms illustration of the RelUNet Conv. using different number of channels.