Table of Contents
Fetching ...

AUREXA-SE: Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement

M. Sajid, Deepanshu Gupta, Yash Modi, Sanskriti Jain, Harshith Jai Surya Ganji, A. Rahaman, Harshvardhan Choudhary, Nasir Saleem, Amir Hussain, M. Tanveer

TL;DR

AUREXA-SE addresses robust audio-visual speech enhancement by integrating raw waveform audio with visual lip cues through bidirectional cross-attention, followed by Squeezeformer-based temporal modeling and a U-Net waveform decoder. The framework employs a U-Net–based 1D audio encoder and a Swin Transformer V2 visual encoder to jointly learn rich cross-modal representations that are refined over time. On AVSE-4, it achieves PESQ 1.325, STOI 0.514, and SI-SDR −4.312 dB after 20 epochs (~50 hours) of training with 54.2M parameters, outperforming noisy inputs and a baseline AVSE model. The approach demonstrates effective cross-modal fusion and efficient temporal processing, offering a practical end-to-end AVSE solution with publicly available code.

Abstract

In this paper, we propose AUREXA-SE (Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement), a progressive bimodal framework tailored for audio-visual speech enhancement (AVSE). AUREXA-SE jointly leverages raw audio waveforms and visual cues by employing a U-Net-based 1D convolutional encoder for audio and a Swin Transformer V2 for efficient and expressive visual feature extraction. Central to the architecture is a novel bidirectional cross-attention mechanism, which facilitates deep contextual fusion between modalities, enabling rich and complementary representation learning. To capture temporal dependencies within the fused embeddings, a stack of lightweight Squeezeformer blocks combining convolutional and attention modules is introduced. The enhanced embeddings are then decoded via a U-Net-style decoder for direct waveform reconstruction, ensuring perceptually consistent and intelligible speech output. Experimental evaluations demonstrate the effectiveness of AUREXA-SE, achieving significant performance improvements over noisy baselines, with STOI of 0.516, PESQ of 1.323, and SI-SDR of -4.322 dB. The source code of AUREXA-SE is available at https://github.com/mtanveer1/AVSEC-4-Challenge-2025.

AUREXA-SE: Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement

TL;DR

AUREXA-SE addresses robust audio-visual speech enhancement by integrating raw waveform audio with visual lip cues through bidirectional cross-attention, followed by Squeezeformer-based temporal modeling and a U-Net waveform decoder. The framework employs a U-Net–based 1D audio encoder and a Swin Transformer V2 visual encoder to jointly learn rich cross-modal representations that are refined over time. On AVSE-4, it achieves PESQ 1.325, STOI 0.514, and SI-SDR −4.312 dB after 20 epochs (~50 hours) of training with 54.2M parameters, outperforming noisy inputs and a baseline AVSE model. The approach demonstrates effective cross-modal fusion and efficient temporal processing, offering a practical end-to-end AVSE solution with publicly available code.

Abstract

In this paper, we propose AUREXA-SE (Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement), a progressive bimodal framework tailored for audio-visual speech enhancement (AVSE). AUREXA-SE jointly leverages raw audio waveforms and visual cues by employing a U-Net-based 1D convolutional encoder for audio and a Swin Transformer V2 for efficient and expressive visual feature extraction. Central to the architecture is a novel bidirectional cross-attention mechanism, which facilitates deep contextual fusion between modalities, enabling rich and complementary representation learning. To capture temporal dependencies within the fused embeddings, a stack of lightweight Squeezeformer blocks combining convolutional and attention modules is introduced. The enhanced embeddings are then decoded via a U-Net-style decoder for direct waveform reconstruction, ensuring perceptually consistent and intelligible speech output. Experimental evaluations demonstrate the effectiveness of AUREXA-SE, achieving significant performance improvements over noisy baselines, with STOI of 0.516, PESQ of 1.323, and SI-SDR of -4.322 dB. The source code of AUREXA-SE is available at https://github.com/mtanveer1/AVSEC-4-Challenge-2025.

Paper Structure

This paper contains 18 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Architecture of proposed AUREXA-SE framework
  • Figure 2: Detailed blueprint of data processing pipeline
  • Figure 3: Detailed blueprint of attention and decoder mechanism
  • Figure 4: Workflow of a Squeezeformer block
  • Figure 5: Validation performance of the AUREXA-SE model across 20 epochs. The graph illustrates trends in SDR, PESQ, STOI, and validation loss.
  • ...and 2 more figures