Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation

Hongming Guo; Ruibo Fu; Yizhong Geng; Shuai Liu; Shuchen Shi; Tao Wang; Chunyu Qiang; Chenxing Li; Ya Li; Zhengqi Wen; Yukun Liu; Xuefei Liu

Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation

Hongming Guo, Ruibo Fu, Yizhong Geng, Shuai Liu, Shuchen Shi, Tao Wang, Chunyu Qiang, Chenxing Li, Ya Li, Zhengqi Wen, Yukun Liu, Xuefei Liu

TL;DR

Problem addressed: Mel-spectrogram-based TTA models often struggle to produce audio with rich texture and detail. Approach: analyze the U-Net component roles and propose Mel-Refine, a training-free, inference-time adjustment that uses Fourier-domain amplification and structure-aware backbone scaling to boost texture. Contributions: (i) mapping of how high- and low-frequency U-Net components affect texture and denoising, (ii) a plug-and-play Mel-Refine method applicable to diffusion-based TTA, validated across Tango, Tango2, and MusTango on Audiocaps and MusicBench with objective and subjective gains. Significance: delivers practical, model-agnostic improvements without additional training, enhancing audio quality in diffusion-based TTA systems.

Abstract

Text-to-audio (TTA) model is capable of generating diverse audio from textual prompts. However, most mainstream TTA models, which predominantly rely on Mel-spectrograms, still face challenges in producing audio with rich content. The intricate details and texture required in Mel-spectrograms for such audio often surpass the models' capacity, leading to outputs that are blurred or lack coherence. In this paper, we begin by investigating the critical role of U-Net in Mel-spectrogram generation. Our analysis shows that in U-Net structure, high-frequency components in skip-connections and the backbone influence texture and detail, while low-frequency components in the backbone are critical for the diffusion denoising process. We further propose ``Mel-Refine'', a plug-and-play approach that enhances Mel-spectrogram texture and detail by adjusting different component weights during inference. Our method requires no additional training or fine-tuning and is fully compatible with any diffusion-based TTA architecture. Experimental results show that our approach boosts performance metrics of the latest TTA model Tango2 by 25\%, demonstrating its effectiveness.

Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation

TL;DR

Abstract

Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)