Table of Contents
Fetching ...

MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

The Hieu Pham, Tan Dat Nguyen, Phuong Thanh Tran, Joon Son Chung, Duc Dung Nguyen

Abstract

Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.

MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

Abstract

Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.

Paper Structure

This paper contains 8 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: A comparison of DNS OVL scores on a real-world test set against model size for several different methods
  • Figure 2: Training pipeline and model design of MAGE. Target audio is first converted into sequence of tokens using a Neural Encodec. These tokens are then masked according to their distribution to form a coarse-to-fine masking strategy, as described in Sec. \ref{['sec:coarse-to-fine']}. Besides, speaker identity is extracted by a Band-Aware Encoder and a pretrained Speaker Encoder, enabling it to capture the acoustic characteristics. The model is optimized using cross-entropy loss applied only on the masked tokens.
  • Figure 3: Ablation study on the number of inference steps for CTF and CTF + Corrector. The overall performance is measured using DNSMOS-OVL on Real Recording DNS dataset