Table of Contents
Fetching ...

ConcateNet: Dialogue Separation Using Local And Global Feature Concatenation

Mhd Modar Halimeh, Matteo Torcoli, Emanuël Habets

TL;DR

ConcateNet tackles dialogue separation under broadcast variability by explicitly incorporating local (narrowband) and global (broadband) feature processing via parallel branches, enabling robust generalization to out-of-domain signals. The method combines a two-stage masking framework with a nonlinear refinement (NLR) to improve multi-frame complex-valued masking, and employs a Gammatone-based spectral representation within an encoder–decoder backbone featuring F-GRU and T-GRU modules. Empirical results show competitive performance on standard noise-reduction datasets and a marked generalization advantage on a broadcast dataset, with NLR providing consistent gains across metrics such as SI-SDR, SI-SIR, 2f-model, and STOI. The work suggests that preserving local features and enabling flexible local/global dependencies can significantly enhance practical dialogue separation in real-world broadcasting scenarios.

Abstract

Dialogue separation involves isolating a dialogue signal from a mixture, such as a movie or a TV program. This can be a necessary step to enable dialogue enhancement for broadcast-related applications. In this paper, ConcateNet for dialogue separation is proposed, which is based on a novel approach for processing local and global features aimed at better generalization for out-of-domain signals. ConcateNet is trained using a noise reduction-focused, publicly available dataset and evaluated using three datasets: two noise reduction-focused datasets (in-domain), which show competitive performance for ConcateNet, and a broadcast-focused dataset (out-of-domain), which verifies the better generalization performance for the proposed architecture compared to considered state-of-the-art noise-reduction methods.

ConcateNet: Dialogue Separation Using Local And Global Feature Concatenation

TL;DR

ConcateNet tackles dialogue separation under broadcast variability by explicitly incorporating local (narrowband) and global (broadband) feature processing via parallel branches, enabling robust generalization to out-of-domain signals. The method combines a two-stage masking framework with a nonlinear refinement (NLR) to improve multi-frame complex-valued masking, and employs a Gammatone-based spectral representation within an encoder–decoder backbone featuring F-GRU and T-GRU modules. Empirical results show competitive performance on standard noise-reduction datasets and a marked generalization advantage on a broadcast dataset, with NLR providing consistent gains across metrics such as SI-SDR, SI-SIR, 2f-model, and STOI. The work suggests that preserving local features and enabling flexible local/global dependencies can significantly enhance practical dialogue separation in real-world broadcasting scenarios.

Abstract

Dialogue separation involves isolating a dialogue signal from a mixture, such as a movie or a TV program. This can be a necessary step to enable dialogue enhancement for broadcast-related applications. In this paper, ConcateNet for dialogue separation is proposed, which is based on a novel approach for processing local and global features aimed at better generalization for out-of-domain signals. ConcateNet is trained using a noise reduction-focused, publicly available dataset and evaluated using three datasets: two noise reduction-focused datasets (in-domain), which show competitive performance for ConcateNet, and a broadcast-focused dataset (out-of-domain), which verifies the better generalization performance for the proposed architecture compared to considered state-of-the-art noise-reduction methods.
Paper Structure (14 sections, 5 equations, 1 figure, 1 table)