Table of Contents
Fetching ...

IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation

Kai Li, Runxuan Yang, Fuchun Sun, Xiaolin Hu

TL;DR

IIANet tackles audio-visual speech separation by introducing a hierarchical, brain-inspired attention framework that separately handles intra-modal and inter-modal interactions across multiple temporal scales. The separation network leverages global guidance and selective sigmoid gating to orchestrate InterA blocks at the top, middle, and bottom, combined with IntraA within modalities, enabling efficient, multi-scale fusion. Empirical results on LRS2, LRS3, and VoxCeleb2 show state-of-the-art separation quality, with IIANet-fast delivering substantial reductions in MACs and CPU time while preserving or improving performance. This work highlights the promise of attention-driven multimodal fusion for robust speech separation in noisy environments and points toward efficient real-time AV processing in practical applications.

Abstract

Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion at a single temporal scale of auditory and visual features without employing selective attention mechanisms, which is in sharp contrast with the brain. To address this issue, We propose a novel model called Intra- and Inter-Attention Network (IIANet), which leverages the attention mechanism for efficient audio-visual feature fusion. IIANet consists of two types of attention blocks: intra-attention (IntraA) and inter-attention (InterA) blocks, where the InterA blocks are distributed at the top, middle and bottom of IIANet. Heavily inspired by the way how human brain selectively focuses on relevant content at various temporal scales, these blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of IIANet, outperforming previous state-of-the-art methods while maintaining comparable inference time. In particular, the fast version of IIANet (IIANet-fast) has only 7% of CTCNet's MACs and is 40% faster than CTCNet on CPUs while achieving better separation quality, showing the great potential of attention mechanism for efficient and effective multimodal fusion.

IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation

TL;DR

IIANet tackles audio-visual speech separation by introducing a hierarchical, brain-inspired attention framework that separately handles intra-modal and inter-modal interactions across multiple temporal scales. The separation network leverages global guidance and selective sigmoid gating to orchestrate InterA blocks at the top, middle, and bottom, combined with IntraA within modalities, enabling efficient, multi-scale fusion. Empirical results on LRS2, LRS3, and VoxCeleb2 show state-of-the-art separation quality, with IIANet-fast delivering substantial reductions in MACs and CPU time while preserving or improving performance. This work highlights the promise of attention-driven multimodal fusion for robust speech separation in noisy environments and points toward efficient real-time AV processing in practical applications.

Abstract

Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion at a single temporal scale of auditory and visual features without employing selective attention mechanisms, which is in sharp contrast with the brain. To address this issue, We propose a novel model called Intra- and Inter-Attention Network (IIANet), which leverages the attention mechanism for efficient audio-visual feature fusion. IIANet consists of two types of attention blocks: intra-attention (IntraA) and inter-attention (InterA) blocks, where the InterA blocks are distributed at the top, middle and bottom of IIANet. Heavily inspired by the way how human brain selectively focuses on relevant content at various temporal scales, these blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of IIANet, outperforming previous state-of-the-art methods while maintaining comparable inference time. In particular, the fast version of IIANet (IIANet-fast) has only 7% of CTCNet's MACs and is 40% faster than CTCNet on CPUs while achieving better separation quality, showing the great potential of attention mechanism for efficient and effective multimodal fusion.
Paper Structure (28 sections, 10 equations, 6 figures, 11 tables)

This paper contains 28 sections, 10 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: The overall pipeline of IIANet. (A) IIANet consists of four main components: audio encoder, video encoder, separation network, and audio decoder. The red and blue tildes indicate that the same module is repeated several times. (B) The separation network contains two types of attention blocks: IntraA and InterA (InterA-T, InterA-M, InterA-B) blocks. The dashed lines indicate the use of global features $\mathbf{S}_G$ and $\mathbf{V}_G$ as top-down attention modulation for multi-scale features $\mathbf{S}_i$ and $\mathbf{V}_i$. All blocks use different parameters but keep the same across different cycles.
  • Figure 2: Flow diagram of InraA and InterA blocks: (A) InterA-T block, (B) IntraA block, (C) InterA-M block and (D) InterA-B block in the IIANet, where $\odot$ denotes element-wise product and $\sigma$ denotes the sigmoid function.
  • Figure 3: The overall pipeline and architecture of IIANet (audio-only).
  • Figure 4: The architecture of IIANet's control models. (A) Control 1. It is obtained by removing the IntraA and InterA blocks of IIANet. (B) Control 2. It is obtained by removing InterA blocks of IIANet.
  • Figure 5: Sample visualizations of faces with different orientations in the LRS2 dataset.
  • ...and 1 more figures