Table of Contents
Fetching ...

Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms

Joseph Konan, Shikhar Agnihotri, Ojas Bhargave, Shuo Han, Yunyang Zeng, Ankit Shah, Bhiksha Raj

TL;DR

The paper addresses how modern VoIP platforms' sender-side denoising alters acoustic quality and intelligibility as measured by $PESQ$ and $STOI$. It introduces a novel application of Blinder-Oaxaca decomposition to disentangle endowment, coefficient, and interaction effects on these psychoacoustic metrics across platform $G$, receiver $C$, and denoising $D$ configurations, using $X_i$ features from openSMILE to model $Y_{PESQ}$ and $Y_{STOI}$. Leveraging the DNS 2020 dataset and the VoIP-DNS-Tiny extension, the study demonstrates substantial platform- and configuration-dependent deviations in perceptual quality and intelligibility, with cloud vs cellular contexts showing distinct patterns. The work provides a rigorous benchmarking and analytical framework for out-of-domain speech enhancement in VoIP, guiding future research toward more robust, context-aware designs.

Abstract

Within the ambit of VoIP (Voice over Internet Protocol) telecommunications, the complexities introduced by acoustic transformations merit rigorous analysis. This research, rooted in the exploration of proprietary sender-side denoising effects, meticulously evaluates platforms such as Google Meets and Zoom. The study draws upon the Deep Noise Suppression (DNS) 2020 dataset, ensuring a structured examination tailored to various denoising settings and receiver interfaces. A methodological novelty is introduced via Blinder-Oaxaca decomposition, traditionally an econometric tool, repurposed herein to analyze acoustic-phonetic perturbations within VoIP systems. To further ground the implications of these transformations, psychoacoustic metrics, specifically PESQ and STOI, were used to explain of perceptual quality and intelligibility. Cumulatively, the insights garnered underscore the intricate landscape of VoIP-influenced acoustic dynamics. In addition to the primary findings, a multitude of metrics are reported, extending the research purview. Moreover, out-of-domain benchmarking for both time and time-frequency domain speech enhancement models is included, thereby enhancing the depth and applicability of this inquiry.

Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms

TL;DR

The paper addresses how modern VoIP platforms' sender-side denoising alters acoustic quality and intelligibility as measured by and . It introduces a novel application of Blinder-Oaxaca decomposition to disentangle endowment, coefficient, and interaction effects on these psychoacoustic metrics across platform , receiver , and denoising configurations, using features from openSMILE to model and . Leveraging the DNS 2020 dataset and the VoIP-DNS-Tiny extension, the study demonstrates substantial platform- and configuration-dependent deviations in perceptual quality and intelligibility, with cloud vs cellular contexts showing distinct patterns. The work provides a rigorous benchmarking and analytical framework for out-of-domain speech enhancement in VoIP, guiding future research toward more robust, context-aware designs.

Abstract

Within the ambit of VoIP (Voice over Internet Protocol) telecommunications, the complexities introduced by acoustic transformations merit rigorous analysis. This research, rooted in the exploration of proprietary sender-side denoising effects, meticulously evaluates platforms such as Google Meets and Zoom. The study draws upon the Deep Noise Suppression (DNS) 2020 dataset, ensuring a structured examination tailored to various denoising settings and receiver interfaces. A methodological novelty is introduced via Blinder-Oaxaca decomposition, traditionally an econometric tool, repurposed herein to analyze acoustic-phonetic perturbations within VoIP systems. To further ground the implications of these transformations, psychoacoustic metrics, specifically PESQ and STOI, were used to explain of perceptual quality and intelligibility. Cumulatively, the insights garnered underscore the intricate landscape of VoIP-influenced acoustic dynamics. In addition to the primary findings, a multitude of metrics are reported, extending the research purview. Moreover, out-of-domain benchmarking for both time and time-frequency domain speech enhancement models is included, thereby enhancing the depth and applicability of this inquiry.
Paper Structure (8 sections, 7 equations, 6 tables)