Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman
TL;DR
This work introduces VoiceFormer, a Transformer-based multi-modal framework for speech separation and enhancement that can condition on asynchronous cues from audio, lip-video, and text. By operating directly on raw waveforms with a U-Net encoder-decoder and a cross-modality Transformer bottleneck, it fuses audio with visual and textual information without requiring strict synchronization. Key contributions include enabling text-conditioned speech enhancement, robustness to audio-visual misalignment, and achieving state-of-the-art results on LRS2 and LRS3 benchmarks, outperforming both audio-only and prior audio-visual methods. The approach has practical implications for robust speech separation in noisy, multi-speaker environments and opens avenues for text-guided isolation in applications like subtitles, accessibility, and teleconferencing, while acknowledging limitations around the need for target utterance text and potential societal risks.
Abstract
The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3.
