Table of Contents
Fetching ...

NoLACE: Improving Low-Complexity Speech Codec Enhancement Through Adaptive Temporal Shaping

Jan Büthe, Ahmed Mustafa, Jean-Marc Valin, Karim Helwani, Michael M. Goodwin

TL;DR

NoLACE tackles the challenge of enhancing low-bitrate speech codec output with a causal, low-complexity approach by introducing an adaptive temporal shaping module to the LACE framework. The method combines AdaShape with multi-stage adaptive convolutions to provide nonlinearity and higher temporal resolution, improving Opus performance at 6, 9, and 12 kb/s while preserving phase and remaining suitable for real-time devices. In extensive evaluations, NoLACE outperformed LACE in listening tests and maintained or improved ASR performance at low bitrates, with results approaching those of non-causal LPCNet resynthesis at higher bitrates. The approach offers a practical path to enhancing existing codecs with minimal decoding overhead and potential applicability to other codecs with pitch information and differentiable DSP blocks.

Abstract

Speech codec enhancement methods are designed to remove distortions added by speech codecs. While classical methods are very low in complexity and add zero delay, their effectiveness is rather limited. Compared to that, DNN-based methods deliver higher quality but they are typically high in complexity and/or require delay. The recently proposed Linear Adaptive Coding Enhancer (LACE) addresses this problem by combining DNNs with classical long-term/short-term postfiltering resulting in a causal low-complexity model. A short-coming of the LACE model is, however, that quality quickly saturates when the model size is scaled up. To mitigate this problem, we propose a novel adatpive temporal shaping module that adds high temporal resolution to the LACE model resulting in the Non-Linear Adaptive Coding Enhancer (NoLACE). We adapt NoLACE to enhance the Opus codec and show that NoLACE significantly outperforms both the Opus baseline and an enlarged LACE model at 6, 9 and 12 kb/s. We also show that LACE and NoLACE are well-behaved when used with an ASR system.

NoLACE: Improving Low-Complexity Speech Codec Enhancement Through Adaptive Temporal Shaping

TL;DR

NoLACE tackles the challenge of enhancing low-bitrate speech codec output with a causal, low-complexity approach by introducing an adaptive temporal shaping module to the LACE framework. The method combines AdaShape with multi-stage adaptive convolutions to provide nonlinearity and higher temporal resolution, improving Opus performance at 6, 9, and 12 kb/s while preserving phase and remaining suitable for real-time devices. In extensive evaluations, NoLACE outperformed LACE in listening tests and maintained or improved ASR performance at low bitrates, with results approaching those of non-causal LPCNet resynthesis at higher bitrates. The approach offers a practical path to enhancing existing codecs with minimal decoding overhead and potential applicability to other codecs with pitch information and differentiable DSP blocks.

Abstract

Speech codec enhancement methods are designed to remove distortions added by speech codecs. While classical methods are very low in complexity and add zero delay, their effectiveness is rather limited. Compared to that, DNN-based methods deliver higher quality but they are typically high in complexity and/or require delay. The recently proposed Linear Adaptive Coding Enhancer (LACE) addresses this problem by combining DNNs with classical long-term/short-term postfiltering resulting in a causal low-complexity model. A short-coming of the LACE model is, however, that quality quickly saturates when the model size is scaled up. To mitigate this problem, we propose a novel adatpive temporal shaping module that adds high temporal resolution to the LACE model resulting in the Non-Linear Adaptive Coding Enhancer (NoLACE). We adapt NoLACE to enhance the Opus codec and show that NoLACE significantly outperforms both the Opus baseline and an enlarged LACE model at 6, 9 and 12 kb/s. We also show that LACE and NoLACE are well-behaved when used with an ASR system.
Paper Structure (15 sections, 8 equations, 3 figures, 1 table)

This paper contains 15 sections, 8 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: High-level overview of the NoLACE model. The feature encoder which transforms the input features into a latent feature vector $\varphi_n^{(1)}$ is depicted on the left and the number of channels are indicated with $\#c=$. The signal processing unit on the right applies first a series of two comb-filtering and spectral shaping operation before entering a select-shape-mix iteration involving the proposed adaptive temporal shaping modules AdaShape
  • Figure 2: Adaptive temporal shaping module. Shapes are given in channels last format, $N$ denotes the frame size and $\mu$ denotes the frame-wise mean value.
  • Figure 3: P.808 results. The clean signal has a MOS of $4.06\pm 0.025$. LACE consistently outperforms the baseline and NoLACE consistently outperforms LACE at all bitrates. At 6 kb/s, NoLACE achieves $92 \%$ of the MOS improvement of the LPCNet resynthesis method which requires 25ms delay and 5x the complexity of NoLACE.