TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion

Samuel Pegg; Kai Li; Xiaolin Hu

TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion

Samuel Pegg, Kai Li, Xiaolin Hu

TL;DR

TDFNet introduces an efficient audio-visual speech separation model that leverages top-down fusion and TDANet-inspired blocks to fuse multi-scale audio-visual features. The architecture comprises a video encoder, an audio encoder with bottlenecked channels, a refinement module with iterative audio-visual fusion, a mask generator, and a decoder, significantly reducing MACs and parameters while achieving state-of-the-art performance on LRS2-2Mix. Ablation studies demonstrate the benefits of using GRU in the audio sub-network and MHSA in the video sub-network, as well as the advantages of parameter sharing strategies. Overall, TDFNet delivers substantial performance gains over CTCNet with only about 30% of MACs and 60% of parameters, offering a practical, low-latency AVSS solution.

Abstract

Audio-visual speech separation has gained significant traction in recent years due to its potential applications in various fields such as speech recognition, diarization, scene analysis and assistive technologies. Designing a lightweight audio-visual speech separation network is important for low-latency applications, but existing methods often require higher computational costs and more parameters to achieve better separation performance. In this paper, we present an audio-visual speech separation model called Top-Down-Fusion Net (TDFNet), a state-of-the-art (SOTA) model for audio-visual speech separation, which builds upon the architecture of TDANet, an audio-only speech separation method. TDANet serves as the architectural foundation for the auditory and visual networks within TDFNet, offering an efficient model with fewer parameters. On the LRS2-2Mix dataset, TDFNet achieves a performance increase of up to 10\% across all performance metrics compared with the previous SOTA method CTCNet. Remarkably, these results are achieved using fewer parameters and only 28\% of the multiply-accumulate operations (MACs) of CTCNet. In essence, our method presents a highly effective and efficient solution to the challenges of speech separation within the audio-visual domain, making significant strides in harnessing visual information optimally.

TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion

TL;DR

Abstract

Paper Structure (24 sections, 24 equations, 6 figures, 6 tables)

This paper contains 24 sections, 24 equations, 6 figures, 6 tables.

Introduction
Related Work
Audio-Only Speech Separation
Audio-Visual Speech Separation
TDFNet
Video Encoder
Audio Encoder
Refinement Module
Audio and Video Sub-Networks
Cross-Modal-Fusion Sub-network
Mask Generator
Decoder
Audio and Video Sub-Network Structure
Experimental Procedures
Dataset
...and 9 more sections

Figures (6)

Figure 1: Audio-visual speech separation process. From top to bottom: video frames and mixed inputs, cutting out lip regions from the video, TDFNet speech separation results.
Figure 2: TDFNet separation pipeline. The audio and video inputs $\pmb{x}$ and $\pmb{y}$ are encoded by $E_a$ and $E_v$ respectively to produce the feature maps $\pmb{a}$ and $\pmb{v}$, which are sent to the refinement module $R$ to be fused and then further processed. The mask generator $M$ then takes these refined features $\pmb{r}$ and generates masks $\pmb{m}_i$, which are multiplied by the encoded audio input $\pmb{a}$ in tern to produce a separation. Finally, the decoder decodes each of the separated encoded audios. The figure above uses $n_{spk}=2$ speakers.
Figure 3: The $j$th TDFNet Block. (a) The internals of a single TDFNet block. (b) The stacked TDFNet blocks with residual connection from the first iteration.
Figure 4: The core architecture of the audio and video sub-networks. The input is either the audio or the visual features, reduced to the hidden dimension $D$. The bottom-up down-sampling in the green block uses consecutive convolutions with stride 2 to compress the data to increasingly small temporal resolutions. The recurrent operator in the pink block fuses the information to formulate a global perspective. The top-down fusion in the blue block combines the global information at different temporal resolutions, and then fuses them all together into a single feature map.
Figure 5: Attention mechanism. This is a diagrammatic view of Equation \ref{['eq:inter_processing']}.
...and 1 more figures

TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion

TL;DR

Abstract

TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (6)