Table of Contents
Fetching ...

UBGAN: Enhancing Coded Speech with Blind and Guided Bandwidth Extension

Kishan Gupta, Srikanth Korse, Andreas Brendel, Nicola Pia, Guillaume Fuchs

TL;DR

UBGAN addresses the practical need to improve perceptual quality of coded speech by extending wideband signals to super-wideband in a modular, codec-agnostic manner. It introduces a subband PQMF-based GAN framework with two variants: blind-UBGAN (no side-information) and guided-UBGAN (0.2 kbps side-information from a learned encoder), enabling robust, low-latency bandwidth extension across conventional and neural codecs. The approach achieves higher objective and competitive subjective quality compared to baselines, while keeping complexity low enough for real-time deployment. This work demonstrates the value of modular BWE for adaptable speech coding and highlights the trade-offs between guided information and core-codec quality, suggesting avenues for broadening to general audio applications.

Abstract

In practical application of speech codecs, a multitude of factors such as the quality of the radio connection, limiting hardware or required user experience necessitate trade-offs between achievable perceptual quality, engendered bitrate and computational complexity. Most conventional and neural speech codecs operate on wideband (WB) speech signals to achieve this compromise. To further enhance the perceptual quality of coded speech, bandwidth extension (BWE) of the transmitted speech is an attractive and popular technique in conventional speech coding. In contrast, neural speech codecs are typically trained end-to-end to a specific set of requirements and are often not easily adaptable. In particular, they are typically trained to operate at a single fixed sampling rate. With the Universal Bandwidth Extension Generative Adversarial Network (UBGAN), we propose a modular and lightweight GAN-based solution that increases the operational flexibility of a wide range of conventional and neural codecs. Our model operates in the subband domain and extends the bandwidth of WB signals from 8 kHz to 16 kHz, resulting in super-wideband (SWB) signals. We further introduce two variants, guided-UBGAN and blind-UBGAN, where the guided version transmits quantized learned representation as a side information at a very low bitrate additional to the bitrate of the codec, while blind-BWE operates without such side-information. Our subjective assessments demonstrate the advantage of UBGAN applied to WB codecs and highlight the generalization capacity of our proposed method across multiple codecs and bitrates.

UBGAN: Enhancing Coded Speech with Blind and Guided Bandwidth Extension

TL;DR

UBGAN addresses the practical need to improve perceptual quality of coded speech by extending wideband signals to super-wideband in a modular, codec-agnostic manner. It introduces a subband PQMF-based GAN framework with two variants: blind-UBGAN (no side-information) and guided-UBGAN (0.2 kbps side-information from a learned encoder), enabling robust, low-latency bandwidth extension across conventional and neural codecs. The approach achieves higher objective and competitive subjective quality compared to baselines, while keeping complexity low enough for real-time deployment. This work demonstrates the value of modular BWE for adaptable speech coding and highlights the trade-offs between guided information and core-codec quality, suggesting avenues for broadening to general audio applications.

Abstract

In practical application of speech codecs, a multitude of factors such as the quality of the radio connection, limiting hardware or required user experience necessitate trade-offs between achievable perceptual quality, engendered bitrate and computational complexity. Most conventional and neural speech codecs operate on wideband (WB) speech signals to achieve this compromise. To further enhance the perceptual quality of coded speech, bandwidth extension (BWE) of the transmitted speech is an attractive and popular technique in conventional speech coding. In contrast, neural speech codecs are typically trained end-to-end to a specific set of requirements and are often not easily adaptable. In particular, they are typically trained to operate at a single fixed sampling rate. With the Universal Bandwidth Extension Generative Adversarial Network (UBGAN), we propose a modular and lightweight GAN-based solution that increases the operational flexibility of a wide range of conventional and neural codecs. Our model operates in the subband domain and extends the bandwidth of WB signals from 8 kHz to 16 kHz, resulting in super-wideband (SWB) signals. We further introduce two variants, guided-UBGAN and blind-UBGAN, where the guided version transmits quantized learned representation as a side information at a very low bitrate additional to the bitrate of the codec, while blind-BWE operates without such side-information. Our subjective assessments demonstrate the advantage of UBGAN applied to WB codecs and highlight the generalization capacity of our proposed method across multiple codecs and bitrates.

Paper Structure

This paper contains 14 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: High-level schematics of UBGAN. Dashed line represents residual connection from downsample blocks to upsample blocks.
  • Figure 2: Description of the downsample and upsample block of UBGAN
  • Figure 3: P.808 DCR scores with 27 listeners for WB codecs with blind-UBGAN and guided-UBGAN.
  • Figure 4: P.808 DCR scores with 21 listeners for semi-SWB/SWB codecs with guided-UBGAN and AP-BWE baseline