Table of Contents
Fetching ...

APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding

Yang Ai, Xiao-Hang Jiang, Ye-Xin Lu, Hui-Peng Du, Zhen-Hua Ling

TL;DR

A novel neural audio codec targeting high waveform sampling rates and low bitrates named APCodec, which seamlessly integrates the strengths of parametric codecs and waveform codecs is introduced, incorporating a knowledge distillation training strategy to enhance the quality of decoded audio.

Abstract

This paper introduces a novel neural audio codec targeting high waveform sampling rates and low bitrates named APCodec, which seamlessly integrates the strengths of parametric codecs and waveform codecs. The APCodec revolutionizes the process of audio encoding and decoding by concurrently handling the amplitude and phase spectra as audio parametric characteristics like parametric codecs. It is composed of an encoder and a decoder with the modified ConvNeXt v2 network as the backbone, connected by a quantizer based on the residual vector quantization (RVQ) mechanism. The encoder compresses the audio amplitude and phase spectra in parallel, amalgamating them into a continuous latent code at a reduced temporal resolution. This code is subsequently quantized by the quantizer. Ultimately, the decoder reconstructs the audio amplitude and phase spectra in parallel, and the decoded waveform is obtained by inverse short-time Fourier transform. To ensure the fidelity of decoded audio like waveform codecs, spectral-level loss, quantization loss, and generative adversarial network (GAN) based loss are collectively employed for training the APCodec. To support low-latency streamable inference, we employ feed-forward layers and causal deconvolutional layers in APCodec, incorporating a knowledge distillation training strategy to enhance the quality of decoded audio. Experimental results confirm that our proposed APCodec can encode 48 kHz audio at bitrate of just 6 kbps, with no significant degradation in the quality of the decoded audio. At the same bitrate, our proposed APCodec also demonstrates superior decoded audio quality and faster generation speed compared to well-known codecs, such as Encodec, AudioDec and DAC.

APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding

TL;DR

A novel neural audio codec targeting high waveform sampling rates and low bitrates named APCodec, which seamlessly integrates the strengths of parametric codecs and waveform codecs is introduced, incorporating a knowledge distillation training strategy to enhance the quality of decoded audio.

Abstract

This paper introduces a novel neural audio codec targeting high waveform sampling rates and low bitrates named APCodec, which seamlessly integrates the strengths of parametric codecs and waveform codecs. The APCodec revolutionizes the process of audio encoding and decoding by concurrently handling the amplitude and phase spectra as audio parametric characteristics like parametric codecs. It is composed of an encoder and a decoder with the modified ConvNeXt v2 network as the backbone, connected by a quantizer based on the residual vector quantization (RVQ) mechanism. The encoder compresses the audio amplitude and phase spectra in parallel, amalgamating them into a continuous latent code at a reduced temporal resolution. This code is subsequently quantized by the quantizer. Ultimately, the decoder reconstructs the audio amplitude and phase spectra in parallel, and the decoded waveform is obtained by inverse short-time Fourier transform. To ensure the fidelity of decoded audio like waveform codecs, spectral-level loss, quantization loss, and generative adversarial network (GAN) based loss are collectively employed for training the APCodec. To support low-latency streamable inference, we employ feed-forward layers and causal deconvolutional layers in APCodec, incorporating a knowledge distillation training strategy to enhance the quality of decoded audio. Experimental results confirm that our proposed APCodec can encode 48 kHz audio at bitrate of just 6 kbps, with no significant degradation in the quality of the decoded audio. At the same bitrate, our proposed APCodec also demonstrates superior decoded audio quality and faster generation speed compared to well-known codecs, such as Encodec, AudioDec and DAC.
Paper Structure (21 sections, 23 equations, 3 figures, 4 tables)

This paper contains 21 sections, 23 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Details of the model structure of the proposed APCodec. Here, Conv1D, DeConv1D, Concat, $\Phi$, STFT and ISTFT represent the 1D convolutional layer, 1D deconvolutional layer, concatenation, phase calculation formula, short-time Fourier transform and inverse short-time Fourier transform, respectively. For waveforms, the content after @ represents the sampling rate, while for spectra and codes, the content after @ represents the frame rate (taking a sampling rate of 48 kHz and a bitrate of 6 kbps as an example).
  • Figure 2: Details of the modified ConvNeXt v2 blcok. Here, Conv1D, GELU and GRN represent the 1D convolutional layer, Gaussian error linear unit and global response normalization, respectively.
  • Figure 3: Details of the training losses of the proposed APCodec. Here, VQ, Conv2D and LReLU represent the vector quantizer, 2D convolutional layer and leaky rectified linear unit, respectively. MSE, MAE, AW-IP, AW-GD and AW-IAF represent mean square error, mean absolute error, anti-wrapping instantaneous phase, anti-wrapping group delay and anti-wrapping instantaneous angular frequency, respectively. STFT and ISTFT represent the short-time Fourier transform and inverse short-time Fourier transform, respectively. The structure of the encoder, quantizer, and decoder is simplified.