Table of Contents
Fetching ...

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement

Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic

TL;DR

RT-LA-VocE tackles real-time, low-SNR audio-visual speech enhancement by redesigning LA-VocE into a fully causal pipeline that operates on 40 ms input frames. It replaces non-causal components with a causal video encoder, a causal 1D audio encoder, Emformer for temporal modeling, and C-HiFi-GAN for waveform synthesis, achieving an end-to-end latency of 28.15 ms per frame. The approach yields state-of-the-art results among causal AVSE models on AVSpeech and remains competitive with non-causal LA-VocE in offline settings, while outperforming audio-only baselines across noise conditions. This work enables low-delay, live enhancement for streaming video applications and demonstrates the practicality of server-side deployment for real-time AVSE with minimal latency.

Abstract

In this paper, we aim to generate clean speech frame by frame from a live video stream and a noisy audio stream without relying on future inputs. To this end, we propose RT-LA-VocE, which completely re-designs every component of LA-VocE, a state-of-the-art non-causal audio-visual speech enhancement model, to perform causal real-time inference with a 40ms input frame. We do so by devising new visual and audio encoders that rely solely on past frames, replacing the Transformer encoder with the Emformer, and designing a new causal neural vocoder C-HiFi-GAN. On the popular AVSpeech dataset, we show that our algorithm achieves state-of-the-art results in all real-time scenarios. More importantly, each component is carefully tuned to minimize the algorithm latency to the theoretical minimum (40ms) while maintaining a low end-to-end processing latency of 28.15ms per frame, enabling real-time frame-by-frame enhancement with minimal delay.

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement

TL;DR

RT-LA-VocE tackles real-time, low-SNR audio-visual speech enhancement by redesigning LA-VocE into a fully causal pipeline that operates on 40 ms input frames. It replaces non-causal components with a causal video encoder, a causal 1D audio encoder, Emformer for temporal modeling, and C-HiFi-GAN for waveform synthesis, achieving an end-to-end latency of 28.15 ms per frame. The approach yields state-of-the-art results among causal AVSE models on AVSpeech and remains competitive with non-causal LA-VocE in offline settings, while outperforming audio-only baselines across noise conditions. This work enables low-delay, live enhancement for streaming video applications and demonstrates the practicality of server-side deployment for real-time AVSE with minimal latency.

Abstract

In this paper, we aim to generate clean speech frame by frame from a live video stream and a noisy audio stream without relying on future inputs. To this end, we propose RT-LA-VocE, which completely re-designs every component of LA-VocE, a state-of-the-art non-causal audio-visual speech enhancement model, to perform causal real-time inference with a 40ms input frame. We do so by devising new visual and audio encoders that rely solely on past frames, replacing the Transformer encoder with the Emformer, and designing a new causal neural vocoder C-HiFi-GAN. On the popular AVSpeech dataset, we show that our algorithm achieves state-of-the-art results in all real-time scenarios. More importantly, each component is carefully tuned to minimize the algorithm latency to the theoretical minimum (40ms) while maintaining a low end-to-end processing latency of 28.15ms per frame, enabling real-time frame-by-frame enhancement with minimal delay.
Paper Structure (14 sections, 4 equations, 2 figures, 4 tables)

This paper contains 14 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: RT-LA-VocE's real-time audio-visual speech enhancement approach. The model processes 40 ms frames in real time.
  • Figure 2: Detailed overview of RT-LA-VocE's inference pipeline for each time-step $t$ (40 ms). RT-LA-VocE receives five video frames, which are passed through our ResNet-based visual encoder, and two raw audio frames, which are encoded via our causal 1D ResNet-18. The resulting features are concatenated channel-wise and fed into the Emformer, which models the temporal dynamics with previous time-steps. This is followed by a linear layer that predicts the four enhanced spectrogram frames. Finally, these frames are combined with past predictions and fed into C-HiFi-GAN, which generates the corresponding waveform.