Table of Contents
Fetching ...

DDD: A Perceptually Superior Low-Response-Time DNN-based Declipper

Jayeon Yi, Junghyun Koo, Kyogu Lee

TL;DR

Clipping distorts audio and challenges real-time speech processing. The authors present DDD, a Demucs-based generator trained with adversarial discriminators to boost perceptual quality while maintaining low latency, enabling real-time speech declipping. Across subjective tests, DDD outperforms A-SPADE and T-UNet, with qualitative analysis showing better high-frequency reconstruction and waveform ruggedness. In streaming setups, DDD achieves sub-decisecond latency, with gains attributable to limited lookahead and training-time discriminators; however, exact perfect declipping remains an open problem.

Abstract

Clipping is a common nonlinear distortion that occurs whenever the input or output of an audio system exceeds the supported range. This phenomenon undermines not only the perception of speech quality but also downstream processes utilizing the disrupted signal. Therefore, a real-time-capable, robust, and low-response-time method for speech declipping (SD) is desired. In this work, we introduce DDD (Demucs-Discriminator-Declipper), a real-time-capable speech-declipping deep neural network (DNN) that requires less response time by design. We first observe that a previously untested real-time-capable DNN model, Demucs, exhibits a reasonable declipping performance. Then we utilize adversarial learning objectives to increase the perceptual quality of output speech without additional inference overhead. Subjective evaluations on harshly clipped speech shows that DDD outperforms the baselines by a wide margin in terms of speech quality. We perform detailed waveform and spectral analyses to gain an insight into the output behavior of DDD in comparison to the baselines. Finally, our streaming simulations also show that DDD is capable of sub-decisecond mean response times, outperforming the state-of-the-art DNN approach by a factor of six.

DDD: A Perceptually Superior Low-Response-Time DNN-based Declipper

TL;DR

Clipping distorts audio and challenges real-time speech processing. The authors present DDD, a Demucs-based generator trained with adversarial discriminators to boost perceptual quality while maintaining low latency, enabling real-time speech declipping. Across subjective tests, DDD outperforms A-SPADE and T-UNet, with qualitative analysis showing better high-frequency reconstruction and waveform ruggedness. In streaming setups, DDD achieves sub-decisecond latency, with gains attributable to limited lookahead and training-time discriminators; however, exact perfect declipping remains an open problem.

Abstract

Clipping is a common nonlinear distortion that occurs whenever the input or output of an audio system exceeds the supported range. This phenomenon undermines not only the perception of speech quality but also downstream processes utilizing the disrupted signal. Therefore, a real-time-capable, robust, and low-response-time method for speech declipping (SD) is desired. In this work, we introduce DDD (Demucs-Discriminator-Declipper), a real-time-capable speech-declipping deep neural network (DNN) that requires less response time by design. We first observe that a previously untested real-time-capable DNN model, Demucs, exhibits a reasonable declipping performance. Then we utilize adversarial learning objectives to increase the perceptual quality of output speech without additional inference overhead. Subjective evaluations on harshly clipped speech shows that DDD outperforms the baselines by a wide margin in terms of speech quality. We perform detailed waveform and spectral analyses to gain an insight into the output behavior of DDD in comparison to the baselines. Finally, our streaming simulations also show that DDD is capable of sub-decisecond mean response times, outperforming the state-of-the-art DNN approach by a factor of six.
Paper Structure (14 sections, 4 equations, 3 figures, 1 table)

This paper contains 14 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The architecture used to train DDD. In training, the output from our lightweight generator (green) and the original signal (black) are fed to the discriminators for an adversarial training objective and to enhance the perceptive quality of restored speech signals. The discriminators are dropped in inference and incur no overhead.
  • Figure 2: Violin plots of subjective evaluation results with VBDM-1dB-Testset (left) and DNS-1dB-Testset (right).
  • Figure 3: Reconstructed results from a hard-clipped signal (SNR=1dB). DDD, T-UNet, and DD reconstruction results are shown with the original clean speech. DDD-declipped speech shows elements of natural speech that baseline T-UNet or DD do not exhibit. The DD/baseline-declipped waveforms typically fail to recreate (a) the "spiky" contours of clean speech as well as (b) some higher-order formants.