Table of Contents
Fetching ...

Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

Anis Hamadouche, Haifeng Luo, Mathini Sellathurai, Tharm Ratnarajah

TL;DR

This work evaluates audio-visual speech enhancement under real-world network conditions, comparing cloud, edge, and device deployments. It uses a CNN for spectral features, lip-region visual cues, and an LSTM for temporal fusion, implemented in a 5G-enabled cloud pipeline. Key findings show cloud offers highest enhancement quality while edge deployment provides a better latency-intelligibility balance under 5G/Wi‑Fi 6; compression and lighter models can enable real-time operation but trade off speech quality. The study provides deployment guidelines for AVSE in applications like assistive hearing, telepresence, and industrial communications, and suggests future work on adaptive chunking and edge collaboration.

Abstract

This paper introduces a new AI-based Audio-Visual Speech Enhancement (AVSE) system and presents a comparative performance analysis of different deployment architectures. The proposed AVSE system employs convolutional neural networks (CNNs) for spectral feature extraction and long short-term memory (LSTM) networks for temporal modeling, enabling robust speech enhancement through multimodal fusion of audio and visual cues. Multiple deployment scenarios are investigated, including cloud-based, edge-assisted, and standalone device implementations. Their performance is evaluated in terms of speech quality improvement, latency, and computational overhead. Real-world experiments are conducted across various network conditions, including Ethernet, Wi-Fi, 4G, and 5G, to analyze the trade-offs between processing delay, communication latency, and perceptual speech quality. The results show that while cloud deployment achieves the highest enhancement quality, edge-assisted architectures offer the best balance between latency and intelligibility, meeting real-time requirements under 5G and Wi-Fi 6 conditions. These findings provide practical guidelines for selecting and optimizing AVSE deployment architectures in diverse applications, including assistive hearing devices, telepresence, and industrial communications.

Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

TL;DR

This work evaluates audio-visual speech enhancement under real-world network conditions, comparing cloud, edge, and device deployments. It uses a CNN for spectral features, lip-region visual cues, and an LSTM for temporal fusion, implemented in a 5G-enabled cloud pipeline. Key findings show cloud offers highest enhancement quality while edge deployment provides a better latency-intelligibility balance under 5G/Wi‑Fi 6; compression and lighter models can enable real-time operation but trade off speech quality. The study provides deployment guidelines for AVSE in applications like assistive hearing, telepresence, and industrial communications, and suggests future work on adaptive chunking and edge collaboration.

Abstract

This paper introduces a new AI-based Audio-Visual Speech Enhancement (AVSE) system and presents a comparative performance analysis of different deployment architectures. The proposed AVSE system employs convolutional neural networks (CNNs) for spectral feature extraction and long short-term memory (LSTM) networks for temporal modeling, enabling robust speech enhancement through multimodal fusion of audio and visual cues. Multiple deployment scenarios are investigated, including cloud-based, edge-assisted, and standalone device implementations. Their performance is evaluated in terms of speech quality improvement, latency, and computational overhead. Real-world experiments are conducted across various network conditions, including Ethernet, Wi-Fi, 4G, and 5G, to analyze the trade-offs between processing delay, communication latency, and perceptual speech quality. The results show that while cloud deployment achieves the highest enhancement quality, edge-assisted architectures offer the best balance between latency and intelligibility, meeting real-time requirements under 5G and Wi-Fi 6 conditions. These findings provide practical guidelines for selecting and optimizing AVSE deployment architectures in diverse applications, including assistive hearing devices, telepresence, and industrial communications.

Paper Structure

This paper contains 7 sections, 3 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: A block diagram of enabling real-world COG-MHEAR service on terminal devices.
  • Figure 2: A schematic describes the experienced latency of this service.
  • Figure 3: Network latency measured in real-world environments (round-trip latency for transferring $\tilde{0}.3$ MB of data) .
  • Figure 4: Audio-video data size versus compression factors.
  • Figure 5: Algorithm processing latency versus input chunk size.
  • ...and 3 more figures