Table of Contents
Fetching ...

Visual-Aware Speech Recognition for Noisy Scenarios

Lakshmipathi Balaji, Karan Singla

TL;DR

The paper tackles transcription accuracy in noisy environments by leveraging environmental visual cues beyond lip movements. It introduces a scalable Visual-Aware Noisy Speech (VANS) data pipeline and a finetuning method that connects pretrained audio and visual encoders through multi-head attention, training to output both the transcript and a noise-label token via CTC loss. The key contributions are the scalable AVSR dataset creation and the visually aware finetuning approach, which together improve transcription accuracy and noise-label prediction, especially across varied SNRs. The findings demonstrate that aligning visual context with noise sources enhances AVSR performance and generalizes across datasets, offering practical benefits for robust speech understanding in real-world noisy scenarios.

Abstract

Humans have the ability to utilize visual cues, such as lip movements and visual scenes, to enhance auditory perception, particularly in noisy environments. However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual cues. Unlike works that rely on lip motion and require the speaker's visibility, we exploit broader visual information from the environment. This allows our model to naturally filter speech from noise and improve transcription, much like humans do in noisy scenarios. Our method re-purposes pretrained speech and visual encoders, linking them with multi-headed attention. This approach enables the transcription of speech and the prediction of noise labels in video inputs. We introduce a scalable pipeline to develop audio-visual datasets, where visual cues correlate to noise in the audio. We show significant improvements over existing audio-only models in noisy scenarios. Results also highlight that visual cues play a vital role in improved transcription accuracy.

Visual-Aware Speech Recognition for Noisy Scenarios

TL;DR

The paper tackles transcription accuracy in noisy environments by leveraging environmental visual cues beyond lip movements. It introduces a scalable Visual-Aware Noisy Speech (VANS) data pipeline and a finetuning method that connects pretrained audio and visual encoders through multi-head attention, training to output both the transcript and a noise-label token via CTC loss. The key contributions are the scalable AVSR dataset creation and the visually aware finetuning approach, which together improve transcription accuracy and noise-label prediction, especially across varied SNRs. The findings demonstrate that aligning visual context with noise sources enhances AVSR performance and generalizes across datasets, offering practical benefits for robust speech understanding in real-world noisy scenarios.

Abstract

Humans have the ability to utilize visual cues, such as lip movements and visual scenes, to enhance auditory perception, particularly in noisy environments. However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual cues. Unlike works that rely on lip motion and require the speaker's visibility, we exploit broader visual information from the environment. This allows our model to naturally filter speech from noise and improve transcription, much like humans do in noisy scenarios. Our method re-purposes pretrained speech and visual encoders, linking them with multi-headed attention. This approach enables the transcription of speech and the prediction of noise labels in video inputs. We introduce a scalable pipeline to develop audio-visual datasets, where visual cues correlate to noise in the audio. We show significant improvements over existing audio-only models in noisy scenarios. Results also highlight that visual cues play a vital role in improved transcription accuracy.

Paper Structure

This paper contains 17 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: A visualization of our architecture. Speech and Visual representations are first obtained from their respective encoders, then aligned and enhanced via a Transformer-based Multi-Head Self-Attention mechanism. The output is then decoded using a convolutional decoder for simultaneous transcript and noise label prediction.
  • Figure 2: Model performance comparison across SNR levels on test set of proposed VANS dataset, highlighting AV-UNI-SNR's robustness in lower SNR environments.