NoLACE: Improving Low-Complexity Speech Codec Enhancement Through Adaptive Temporal Shaping
Jan Büthe, Ahmed Mustafa, Jean-Marc Valin, Karim Helwani, Michael M. Goodwin
TL;DR
NoLACE tackles the challenge of enhancing low-bitrate speech codec output with a causal, low-complexity approach by introducing an adaptive temporal shaping module to the LACE framework. The method combines AdaShape with multi-stage adaptive convolutions to provide nonlinearity and higher temporal resolution, improving Opus performance at 6, 9, and 12 kb/s while preserving phase and remaining suitable for real-time devices. In extensive evaluations, NoLACE outperformed LACE in listening tests and maintained or improved ASR performance at low bitrates, with results approaching those of non-causal LPCNet resynthesis at higher bitrates. The approach offers a practical path to enhancing existing codecs with minimal decoding overhead and potential applicability to other codecs with pitch information and differentiable DSP blocks.
Abstract
Speech codec enhancement methods are designed to remove distortions added by speech codecs. While classical methods are very low in complexity and add zero delay, their effectiveness is rather limited. Compared to that, DNN-based methods deliver higher quality but they are typically high in complexity and/or require delay. The recently proposed Linear Adaptive Coding Enhancer (LACE) addresses this problem by combining DNNs with classical long-term/short-term postfiltering resulting in a causal low-complexity model. A short-coming of the LACE model is, however, that quality quickly saturates when the model size is scaled up. To mitigate this problem, we propose a novel adatpive temporal shaping module that adds high temporal resolution to the LACE model resulting in the Non-Linear Adaptive Coding Enhancer (NoLACE). We adapt NoLACE to enhance the Opus codec and show that NoLACE significantly outperforms both the Opus baseline and an enlarged LACE model at 6, 9 and 12 kb/s. We also show that LACE and NoLACE are well-behaved when used with an ASR system.
