Table of Contents
Fetching ...

Phase Repair for Time-Domain Convolutional Neural Networks in Music Super-Resolution

Yenan Zhang, Guilly Kolkman, Hiroshi Watanabe

TL;DR

This work is the first to demonstrate the artifacts in TD-CNNs are caused by the phase distortion via a subjective experiment, and proposes Time-Domain Phase Repair (TD-PR), which uses a neural vocoder pre-trained on the wide-band data to repair the phase components in the waveform outputs of TD-CNNs.

Abstract

Audio Super-Resolution (SR) is an important topic as low-resolution recordings are ubiquitous in daily life. In this paper, we focus on the music SR task, which is challenging due to the wide frequency response and dynamic range of music. Many models are designed in time domain to jointly process magnitude and phase of audio signals. However, prior works show that approaches using Time-Domain Convolutional Neural Network (TD-CNN) tend to produce annoying artifacts in their waveform outputs, and the cause of the artifacts is yet to be identified. To the best of our knowledge, this work is the first to demonstrate the artifacts in TD-CNNs are caused by the phase distortion via a subjective experiment. We further propose Time-Domain Phase Repair (TD-PR), which uses a neural vocoder pre-trained on the wide-band data to repair the phase components in the waveform outputs of TD-CNNs. Although the vocoder and TD-CNNs are independently trained, the proposed TD-PR obtained better mean opinion score, significantly improving the perceptual quality of TD-CNN baselines. Since the proposed TD-PR only repairs the phase components of the waveforms, the improved perceptual quality in turn indicates that phase distortion has been the cause of the annoying artifacts of TD-CNNs. Moreover, a single pretrained vocoder can be directly applied to arbitrary TD-CNNs without additional adaptation. Therefore, we apply TD-PR to three TD-CNNs that have different architecture and parameter amount. Consistent improvements are observed when TD-PR is applied to all three TD-CNN baselines. Audio samples are available on the demo page.

Phase Repair for Time-Domain Convolutional Neural Networks in Music Super-Resolution

TL;DR

This work is the first to demonstrate the artifacts in TD-CNNs are caused by the phase distortion via a subjective experiment, and proposes Time-Domain Phase Repair (TD-PR), which uses a neural vocoder pre-trained on the wide-band data to repair the phase components in the waveform outputs of TD-CNNs.

Abstract

Audio Super-Resolution (SR) is an important topic as low-resolution recordings are ubiquitous in daily life. In this paper, we focus on the music SR task, which is challenging due to the wide frequency response and dynamic range of music. Many models are designed in time domain to jointly process magnitude and phase of audio signals. However, prior works show that approaches using Time-Domain Convolutional Neural Network (TD-CNN) tend to produce annoying artifacts in their waveform outputs, and the cause of the artifacts is yet to be identified. To the best of our knowledge, this work is the first to demonstrate the artifacts in TD-CNNs are caused by the phase distortion via a subjective experiment. We further propose Time-Domain Phase Repair (TD-PR), which uses a neural vocoder pre-trained on the wide-band data to repair the phase components in the waveform outputs of TD-CNNs. Although the vocoder and TD-CNNs are independently trained, the proposed TD-PR obtained better mean opinion score, significantly improving the perceptual quality of TD-CNN baselines. Since the proposed TD-PR only repairs the phase components of the waveforms, the improved perceptual quality in turn indicates that phase distortion has been the cause of the annoying artifacts of TD-CNNs. Moreover, a single pretrained vocoder can be directly applied to arbitrary TD-CNNs without additional adaptation. Therefore, we apply TD-PR to three TD-CNNs that have different architecture and parameter amount. Consistent improvements are observed when TD-PR is applied to all three TD-CNN baselines. Audio samples are available on the demo page.
Paper Structure (15 sections, 5 equations, 5 figures, 1 table)

This paper contains 15 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the proposed TD-PR: The TD-CNN is trained to perform super-resolution for various narrow-band inputs. The neural vocoder takes only the magnitude of the TD-CNN's output as input, and re-synthesizes another waveform that contains repaired phase components. Then, the distorted phase components in TD-CNN's output is replaced by that from the vocoder.
  • Figure 2: Results of the preliminary AB listening test: 95.38% of the TD-CNN w/ GT-phase is voted to have fewer artifacts.
  • Figure 3: Results of MOS listening test: The box plot of the ratings across input, TD-CNN, TD-PR and GT. TD-PR is applied to three different TD-CNN baselines.
  • Figure 4: Visualization of a set of phase spectrograms: (a) low-resolution input; (b) ground truth; (c-1) SEANet; (c-2) SEANet w/ TD-PR (proposed); (d-1) AudioUNet; (d-2) AudioUNet w/ TD-PR (proposed); (e-2) Demcus; (e-2) Demucs w/ TD-PR (proposed).
  • Figure 5: Visualization of a set of magnitude spectrograms: (a) low-resolution input; (b) ground truth; (c-1) SEANet; (c-2) SEANet w/ TD-PR (proposed); (d-1) AudioUNet: (d-2) AudioUNet w/ TD-PR (proposed); (e-2) Demcus; (e-2) Demucs w/ TD-PR (proposed).