Table of Contents
Fetching ...

When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems

Sujal Chondhekar, Vasanth Murukuri, Rushabh Vasani, Sanika Goyal, Rajshree Badami, Anushree Rana, Sanjana SN, Karthik Pandia, Sulabh Katiyar, Neha Jagadeesh, Sankalp Gulati

TL;DR

This work interrogates the common belief that speech enhancement uniformly improves ASR in noisy clinical settings. By systematically evaluating MetricGAN+-voicebank denoising across four modern ASR systems using 500 English medical recordings under nine noise conditions and semWER as the metric, the authors reveal that denoising consistently degrades performance, with Gaussian noise causing the largest drops and several cases of dramatic failure. The study highlights that modern end-to-end ASR models already exhibit substantial noise robustness and may rely on acoustic cues degraded by aggressive enhancement. Practically, the findings advise against default denoising in medical transcription pipelines and motivate ASR-aware enhancement or joint optimization approaches tailored to robust recognition in real-world clinical environments.

Abstract

Speech enhancement methods are commonly believed to improve the performance of automatic speech recognition (ASR) in noisy environments. However, the effectiveness of these techniques cannot be taken for granted in the case of modern large-scale ASR models trained on diverse, noisy data. We present a systematic evaluation of MetricGAN-plus-voicebank denoising on four state-of-the-art ASR systems: OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, Parrotlet-a using 500 medical speech recordings under nine noise conditions. ASR performance is measured using semantic WER (semWER), a normalized word error rate (WER) metric accounting for domain-specific normalizations. Our results reveal a counterintuitive finding: speech enhancement preprocessing degrades ASR performance across all noise conditions and models. Original noisy audio achieves lower semWER than enhanced audio in all 40 tested configurations (4 models x 10 conditions), with degradations ranging from 1.1% to 46.6% absolute semWER increase. These findings suggest that modern ASR models possess sufficient internal noise robustness and that traditional speech enhancement may remove acoustic features critical for ASR. For practitioners deploying medical scribe systems in noisy clinical environments, our results indicate that preprocessing audio with noise reduction techniques might not just be computationally wasteful but also be potentially harmful to the transcription accuracy.

When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems

TL;DR

This work interrogates the common belief that speech enhancement uniformly improves ASR in noisy clinical settings. By systematically evaluating MetricGAN+-voicebank denoising across four modern ASR systems using 500 English medical recordings under nine noise conditions and semWER as the metric, the authors reveal that denoising consistently degrades performance, with Gaussian noise causing the largest drops and several cases of dramatic failure. The study highlights that modern end-to-end ASR models already exhibit substantial noise robustness and may rely on acoustic cues degraded by aggressive enhancement. Practically, the findings advise against default denoising in medical transcription pipelines and motivate ASR-aware enhancement or joint optimization approaches tailored to robust recognition in real-world clinical environments.

Abstract

Speech enhancement methods are commonly believed to improve the performance of automatic speech recognition (ASR) in noisy environments. However, the effectiveness of these techniques cannot be taken for granted in the case of modern large-scale ASR models trained on diverse, noisy data. We present a systematic evaluation of MetricGAN-plus-voicebank denoising on four state-of-the-art ASR systems: OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, Parrotlet-a using 500 medical speech recordings under nine noise conditions. ASR performance is measured using semantic WER (semWER), a normalized word error rate (WER) metric accounting for domain-specific normalizations. Our results reveal a counterintuitive finding: speech enhancement preprocessing degrades ASR performance across all noise conditions and models. Original noisy audio achieves lower semWER than enhanced audio in all 40 tested configurations (4 models x 10 conditions), with degradations ranging from 1.1% to 46.6% absolute semWER increase. These findings suggest that modern ASR models possess sufficient internal noise robustness and that traditional speech enhancement may remove acoustic features critical for ASR. For practitioners deploying medical scribe systems in noisy clinical environments, our results indicate that preprocessing audio with noise reduction techniques might not just be computationally wasteful but also be potentially harmful to the transcription accuracy.

Paper Structure

This paper contains 20 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: semWER (%) (lower is better) performance of all evaluated ASR models under various noisy conditions. Parrotlet-a achieves the lowest error rates across all conditions.
  • Figure 2: Performance of four ASR models in various conditions. We observe degradation in the performance of these models beyond certain SNR levels.
  • Figure 3: semWER (%) performance of all ASR models after enhancement with the SpeechBrain denoiser.
  • Figure 4: Change in Semantic Word Error Rate ($\Delta$semWER) after denoising across ASR models and noise conditions. We observe performance degradation due to denoising. Results demonstrate significant variation in denoising effectiveness across models, with Whisper showing the most substantial sensitivity to enhancement.
  • Figure 5: Whisper semWER (%) and $\Delta$semWER. Enhancement consistently increases semWER across all conditions, with extreme degradation under Gaussian noise.
  • ...and 3 more figures