Table of Contents
Fetching ...

From the perspective of perceptual speech quality: The robustness of frequency bands to noise

Junyi Fan, Donald S. Williamson

TL;DR

This study investigates how real-world noise affects perceptual speech quality across 32 frequency bands, using a MUSHRA-inspired subjective test to quantify band-level robustness. Stimuli are constructed by filtering speech and noise into bands (100–7500 Hz) and reconstructing signals with target-band and non-target combinations across multiple SNRs, yielding per-band robustness indices $B_{norm}(i,r)$. Results show mid-frequency bands are generally less robust to noise in perceptual quality, while low- and high-frequency bands tend to be more robust, with significant differences across SNRs and noise types; ESTOI aligns more closely with subjective results than PESQ or STOI. The findings inform band-aware strategies for speech enhancement and emphasize the limitations of current objective metrics in predicting perceptual speech quality under realistic noisy conditions, with implications for telecommunications and hearing-impaired applications, and point to the need for better quality metrics and more efficient evaluation methods.

Abstract

Speech quality is one of the main foci of speech-related research, where it is frequently studied with speech intelligibility, another essential measurement. Band-level perceptual speech intelligibility, however, has been studied frequently, whereas speech quality has not been thoroughly analyzed. In this paper, a Multiple Stimuli With Hidden Reference and Anchor (MUSHRA) inspired approach was proposed to study the individual robustness of frequency bands to noise with perceptual speech quality as the measure. Speech signals were filtered into thirty-two frequency bands with compromising real-world noise employed at different signal-to-noise ratios. Robustness to noise indices of individual frequency bands was calculated based on the human-rated perceptual quality scores assigned to the reconstructed noisy speech signals. Trends in the results suggest the mid-frequency region appeared less robust to noise in terms of perceptual speech quality. These findings suggest future research aiming at improving speech quality should pay more attention to the mid-frequency region of the speech signals accordingly.

From the perspective of perceptual speech quality: The robustness of frequency bands to noise

TL;DR

This study investigates how real-world noise affects perceptual speech quality across 32 frequency bands, using a MUSHRA-inspired subjective test to quantify band-level robustness. Stimuli are constructed by filtering speech and noise into bands (100–7500 Hz) and reconstructing signals with target-band and non-target combinations across multiple SNRs, yielding per-band robustness indices . Results show mid-frequency bands are generally less robust to noise in perceptual quality, while low- and high-frequency bands tend to be more robust, with significant differences across SNRs and noise types; ESTOI aligns more closely with subjective results than PESQ or STOI. The findings inform band-aware strategies for speech enhancement and emphasize the limitations of current objective metrics in predicting perceptual speech quality under realistic noisy conditions, with implications for telecommunications and hearing-impaired applications, and point to the need for better quality metrics and more efficient evaluation methods.

Abstract

Speech quality is one of the main foci of speech-related research, where it is frequently studied with speech intelligibility, another essential measurement. Band-level perceptual speech intelligibility, however, has been studied frequently, whereas speech quality has not been thoroughly analyzed. In this paper, a Multiple Stimuli With Hidden Reference and Anchor (MUSHRA) inspired approach was proposed to study the individual robustness of frequency bands to noise with perceptual speech quality as the measure. Speech signals were filtered into thirty-two frequency bands with compromising real-world noise employed at different signal-to-noise ratios. Robustness to noise indices of individual frequency bands was calculated based on the human-rated perceptual quality scores assigned to the reconstructed noisy speech signals. Trends in the results suggest the mid-frequency region appeared less robust to noise in terms of perceptual speech quality. These findings suggest future research aiming at improving speech quality should pay more attention to the mid-frequency region of the speech signals accordingly.

Paper Structure

This paper contains 14 sections, 4 equations, 7 figures.

Figures (7)

  • Figure 1: A depiction of the frequency responses to extract bands 5 through 9 given a 90-second white-noise input signal.
  • Figure 2: Mean perceptual speech quality scores (e.g., circle markers) based on the listening test as a function of the SNR of the target bands in the preliminary experiment. Scores are calculated by averaging across all 17 band combinations. A second-order regression fit is present as the dashed line. Boxplots are also shown to indicate the 25th, 50th (median), and 75th percentiles at each SNR condition.
  • Figure 3: Mean perceptual speech quality scores as a function of the center frequencies of the 32 bands at all 6 SNR conditions in the preliminary experiment. Scores are calculated based on the approach introduced in Section \ref{['subsec:2:4']}. The circle markers denote the actual scores. The smoothed curve is based on the B-spline interpolation of the actual scores with a degree of 2. The dotted lines from the bottom to the top each represent the 25th, 50th (median), and 75th percentile of the 32 scores, respectively.
  • Figure 4: Mean perceptual speech quality scores (e.g., circle markers) based on the listening test as a function of the SNR of the target bands in the primary experiment. Scores are calculated by averaging across all 32 band combinations.
  • Figure 5: Mean perceptual speech quality scores of the 32 bands at all 3 SNR conditions in the primary experiment.
  • ...and 2 more figures