Table of Contents
Fetching ...

Improving DF-Conformer Using Hydra For High-Fidelity Generative Speech Enhancement on Discrete Codec Token

Shogo Seki, Shaoxiang Dang, Li Li

TL;DR

The study targets high-fidelity speech enhancement with Genhancer, revealing limitations of FAVOR+-based DF-Conformer in capturing global dependencies under linear attention. It introduces DC-Hydra, replacing FAVOR+ with Hydra within a matrix-mixer state-space framework to enable bidirectional modeling while preserving linear complexity $O(Trd)$; this is applied to Genhancer’s discrete-token generation pipeline. Empirical results show DC-Hydra (Hydra) outperforms FAVOR+ and Bi-Mamba across key metrics, achieving competitive CAcc and superior robustness to longer input sequences, even as Softmax remains strong on some perceptual metrics. The work demonstrates that bidirectional state-space sequence models can effectively replace approximate softmax attention in generative SE on discrete tokens, offering scalability and improved global sequence modeling for practical deployments.

Abstract

The Dilated FAVOR Conformer (DF-Conformer) is an efficient variant of the Conformer architecture designed for speech enhancement (SE). It employs fast attention through positive orthogonal random features (FAVOR+) to mitigate the quadratic complexity associated with self-attention, while utilizing dilated convolution to expand the receptive field. This combination results in impressive performance across various SE models. In this paper, we propose replacing FAVOR+ with bidirectional selective structured state-space sequence models to achieve two main objectives:(1) enhancing global sequential modeling by eliminating the approximations inherent in FAVOR+, and (2) maintaining linear complexity relative to the sequence length. Specifically, we utilize Hydra, a bidirectional extension of Mamba, framed within the structured matrix mixer framework. Experiments conducted using a generative SE model on discrete codec tokens, known as Genhancer, demonstrate that the proposed method surpasses the performance of the DF-Conformer.

Improving DF-Conformer Using Hydra For High-Fidelity Generative Speech Enhancement on Discrete Codec Token

TL;DR

The study targets high-fidelity speech enhancement with Genhancer, revealing limitations of FAVOR+-based DF-Conformer in capturing global dependencies under linear attention. It introduces DC-Hydra, replacing FAVOR+ with Hydra within a matrix-mixer state-space framework to enable bidirectional modeling while preserving linear complexity ; this is applied to Genhancer’s discrete-token generation pipeline. Empirical results show DC-Hydra (Hydra) outperforms FAVOR+ and Bi-Mamba across key metrics, achieving competitive CAcc and superior robustness to longer input sequences, even as Softmax remains strong on some perceptual metrics. The work demonstrates that bidirectional state-space sequence models can effectively replace approximate softmax attention in generative SE on discrete tokens, offering scalability and improved global sequence modeling for practical deployments.

Abstract

The Dilated FAVOR Conformer (DF-Conformer) is an efficient variant of the Conformer architecture designed for speech enhancement (SE). It employs fast attention through positive orthogonal random features (FAVOR+) to mitigate the quadratic complexity associated with self-attention, while utilizing dilated convolution to expand the receptive field. This combination results in impressive performance across various SE models. In this paper, we propose replacing FAVOR+ with bidirectional selective structured state-space sequence models to achieve two main objectives:(1) enhancing global sequential modeling by eliminating the approximations inherent in FAVOR+, and (2) maintaining linear complexity relative to the sequence length. Specifically, we utilize Hydra, a bidirectional extension of Mamba, framed within the structured matrix mixer framework. Experiments conducted using a generative SE model on discrete codec tokens, known as Genhancer, demonstrate that the proposed method surpasses the performance of the DF-Conformer.

Paper Structure

This paper contains 13 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of Genhancer
  • Figure 2: Examples of attention maps averaged over heads, along with corresponding ranks in different layers, obtained using softmax attention (1st row) and FAVOR+ (2nd row). Histogram of $L2$ norm difference between attention vectors for different queries (3rd row).
  • Figure 3: Architecture of DC-Hydra.
  • Figure 4: Character accuracy (CAccs) on different sequence lengths, with the babble size indicating GPU memory usage in the token generator $\mathcal{G}$.