Efficient Evaluation of Quantization-Effects in Neural Codecs
Wolfgang Mack, Ahmed Mustafa, Rafał Łaganowski, Samer Hijazy
TL;DR
The paper introduces an efficient framework for evaluating quantization effects in neural codecs by using surrogate data with a fixed bit budget and a lightweight surrogate codec to mimic non-linearities. It formalizes encoder/quantizer/decoder interactions, analyzes gradient-tracking challenges through $\mathcal{Q}$, and surveys approaches like STE, soft quantization, and noise emulation. A Modified Straight-Through Estimator (mSTE) is proposed to stabilize training by tying quantization noise to the computational graph via a noise-mcale factor, reducing encoder growth and divergence observed with standard STE. The framework enables rapid prototyping (training costs under 1 hour on a GPU with <400 MB memory) and is validated against an internal audio codec and the descript-audio-codec, demonstrating improved stability and performance. This work accelerates analysis of quantization in neural codecs and provides a practical path for future exploration of gradient estimators and architectural variations.
Abstract
Neural codecs, comprising an encoder, quantizer, and decoder, enable signal transmission at exceptionally low bitrates. Training these systems requires techniques like the straight-through estimator, soft-to-hard annealing, or statistical quantizer emulation to allow a non-zero gradient across the quantizer. Evaluating the effect of quantization in neural codecs, like the influence of gradient passing techniques on the whole system, is often costly and time-consuming due to training demands and the lack of affordable and reliable metrics. This paper proposes an efficient evaluation framework for neural codecs using simulated data with a defined number of bits and low-complexity neural encoders/decoders to emulate the non-linear behavior in larger networks. Our system is highly efficient in terms of training time and computational and hardware requirements, allowing us to uncover distinct behaviors in neural codecs. We propose a modification to stabilize training with the straight-through estimator based on our findings. We validate our findings against an internal neural audio codec and against the state-of-the-art descript-audio-codec.
