HILCodec: High-Fidelity and Lightweight Neural Audio Codec

Sunghwan Ahn; Beom Jun Woo; Min Hyun Han; Chanyeong Moon; Nam Soo Kim

HILCodec: High-Fidelity and Lightweight Neural Audio Codec

Sunghwan Ahn, Beom Jun Woo, Min Hyun Han, Chanyeong Moon, Nam Soo Kim

TL;DR

HILCodec, a real-time streaming audio codec, is a real-time streaming audio codec that demonstrates state-of-the-art quality across various bitrates and audio types.

Abstract

The recent advancement of end-to-end neural audio codecs enables compressing audio at very low bitrates while reconstructing the output audio with high fidelity. Nonetheless, such improvements often come at the cost of increased model complexity. In this paper, we identify and address the problems of existing neural audio codecs. We show that the performance of the SEANet-based codec does not increase consistently as the network depth increases. We analyze the root cause of such a phenomenon and suggest a variance-constrained design. Also, we reveal various distortions in previous waveform domain discriminators and propose a novel distortion-free discriminator. The resulting model, HILCodec, is a real-time streaming audio codec that demonstrates state-of-the-art quality across various bitrates and audio types.

HILCodec: High-Fidelity and Lightweight Neural Audio Codec

TL;DR

HILCodec, a real-time streaming audio codec, is a real-time streaming audio codec that demonstrates state-of-the-art quality across various bitrates and audio types.

Abstract

Paper Structure (44 sections, 18 equations, 7 figures, 5 tables)

This paper contains 44 sections, 18 equations, 7 figures, 5 tables.

Introduction
Related Works
Neural Vocoder
End-to-end Neural Audio Codec
Generator
Encoder-Decoder
Residual Vector Quantizer
$L_2$-normalization
Spectrogram Block
Variance-Constrained Design
Variance Explosion Problem of SEANet
Variance-Constrained Residual Block
Other Building Blocks
Input/Output Normalization
Zero-Initialized Residual Branch
...and 29 more sections

Figures (7)

Figure 1: HILCodec model architecture. Conv(C, K, S), DConv(C, K, S), and DConvT(C, K, S) represent a convolution, a depthwise convolution, and a depthwise transposed convolution respectively, each with output channels of C, a kernel size of K, and a stride of S. 1x1Conv(C) denotes a pointwise convolution with output channels of C.
Figure 2: Comparison of the average channel variance after initialization according to network depth. On the x-axis, each number represents the network depth of the encoder and the decoder, and the terms “In”, “Q”, and “Out” denote the input waveform, the output of the residual vector quantizer, and the output waveform, respectively. The logaritmic y-scale on the left corresponds to the baseline model, while the linear y-scale on the right corresponds to the variance constrained model. The solid lines represent the residual blocks.
Figure 3: (a)-(c): Comparison of different discriminators. (d)-(e): Magnitude responses of filters they use. (a) AvgPool denotes an average pooling. (b) $z^{-n}$ denotes $n$-sample delay. (a)-(c) $\downarrow$$N$ denotes downsampling by a factor of $N$, and $D(\theta)$ denotes a sub-discriminator with a parameter $\theta$. MSD uses distinct sub-discriminators for different hierarchies while MPD and MFBD use shared parameters across the sub-bands.
Figure 4: MUSHRA scores of various codecs with 95% confidence intervals. (a) Low bitrate, trained and evaluated on general audio. (b) High bitrate, trained and evaluated on general audio. (c) Low bitrate, trained and evaluated on clean speech.
Figure 5: MUSHRA scores for different audio types with 95% confidence intervals. (d) "w/o Arch" denotes HILCodec without $L_2$-normalization and spectrogram blocks, "w/o VCD" denotes HILCodec without the variance-constrained design, and "w/o VCD, w/ BN" denotes HILCodec without the variance-constrained design and with batch normalization layers.
...and 2 more figures

HILCodec: High-Fidelity and Lightweight Neural Audio Codec

TL;DR

Abstract

HILCodec: High-Fidelity and Lightweight Neural Audio Codec

Authors

TL;DR

Abstract

Table of Contents

Figures (7)