Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture
Karamvir Singh
TL;DR
The paper tackles ASR performance degradation in noisy environments by integrating a noise-detection pathway into wav2vec2, forming a dual-head architecture that jointly optimizes transcription (via CTC loss) and noise classification (via cross-entropy loss). It systematically evaluates architectural variants using speech and environmental audio datasets, demonstrating near-perfect noise detection with Configurations B–D while preserving or improving transcription accuracy compared with a strong baseline. Key findings show that explicit noise handling enables effective joint optimization and that trainable loss balancing offers limited but positive gains, with fusion strategies providing marginal benefits. The work advances practical robust ASR by embedding signal-quality awareness into the recognition model, with implications for multilingual and adaptive-inference applications in challenging acoustic conditions.
Abstract
This research presents a novel approach to enhancing automatic speech recognition systems by integrating noise detection capabilities directly into the recognition architecture. Building upon the wav2vec2 framework, the proposed method incorporates a dedicated noise identification module that operates concurrently with speech transcription. Experimental validation using publicly available speech and environmental audio datasets demonstrates substantial improvements in transcription quality and noise discrimination. The enhanced system achieves superior performance in word error rate, character error rate, and noise detection accuracy compared to conventional architectures. Results indicate that joint optimization of transcription and noise classification objectives yields more reliable speech recognition in challenging acoustic conditions.
