Back to Ear: Perceptually Driven High Fidelity Music Reconstruction

Kangdi Wang; Zhiyue Wu; Dinghao Zhou; Rui Lin; Junyu Dai; Tao Jiang

Back to Ear: Perceptually Driven High Fidelity Music Reconstruction

Kangdi Wang, Zhiyue Wu, Dinghao Zhou, Rui Lin, Junyu Dai, Tao Jiang

TL;DR

εar-VAE tackles perceptually faithful music reconstruction by integrating a K-weighting perceptual filter and phase-derivative losses into a VAE framework, improving phase coherence and stereo spatial fidelity. The architecture combines a convolutional encoder–decoder with transformer-based bottlenecks, a Multi-Resolution STFT discriminator, and a spectral supervision strategy that uses all four MSLR components for magnitude while constraining phase to LR. Novel losses—Correlation and Phase losses—together with perceptual weighting yield state-of-the-art open-source reconstruction at 44.1 kHz, particularly enhancing high-frequency harmonics and spatial cues. Evaluation on MuChin and in-house sets with metrics such as MS-STFT, MS-Mel, SI-SDR, and the new ICPC/CCPC scores demonstrates clear gains over EnCodec, DAC, AGC, and SAO. These results advance high-fidelity neural audio codecs and enable more controllable, realistic music generation.

Abstract

Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representation. To address these challenges, we propose εar-VAE, an open-source music signal reconstruction model that rethinks and optimizes the VAE training paradigm. Our contributions are threefold: (i) A K-weighting perceptual filter applied prior to loss calculation to align the objective with auditory perception. (ii) Two novel phase losses: a Correlation Loss for stereo coherence, and a Phase Loss using its derivatives--Instantaneous Frequency and Group Delay--for precision. (iii) A new spectral supervision paradigm where magnitude is supervised by all four Mid/Side/Left/Right components, while phase is supervised only by the LR components. Experiments show εar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and the spatial characteristics.

Back to Ear: Perceptually Driven High Fidelity Music Reconstruction

TL;DR

Abstract

Back to Ear: Perceptually Driven High Fidelity Music Reconstruction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)