Table of Contents
Fetching ...

Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation

Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman

TL;DR

This work introduces VoiceFormer, a Transformer-based multi-modal framework for speech separation and enhancement that can condition on asynchronous cues from audio, lip-video, and text. By operating directly on raw waveforms with a U-Net encoder-decoder and a cross-modality Transformer bottleneck, it fuses audio with visual and textual information without requiring strict synchronization. Key contributions include enabling text-conditioned speech enhancement, robustness to audio-visual misalignment, and achieving state-of-the-art results on LRS2 and LRS3 benchmarks, outperforming both audio-only and prior audio-visual methods. The approach has practical implications for robust speech separation in noisy, multi-speaker environments and opens avenues for text-guided isolation in applications like subtitles, accessibility, and teleconferencing, while acknowledging limitations around the need for target utterance text and potential societal risks.

Abstract

The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3.

Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation

TL;DR

This work introduces VoiceFormer, a Transformer-based multi-modal framework for speech separation and enhancement that can condition on asynchronous cues from audio, lip-video, and text. By operating directly on raw waveforms with a U-Net encoder-decoder and a cross-modality Transformer bottleneck, it fuses audio with visual and textual information without requiring strict synchronization. Key contributions include enabling text-conditioned speech enhancement, robustness to audio-visual misalignment, and achieving state-of-the-art results on LRS2 and LRS3 benchmarks, outperforming both audio-only and prior audio-visual methods. The approach has practical implications for robust speech separation in noisy, multi-speaker environments and opens avenues for text-guided isolation in applications like subtitles, accessibility, and teleconferencing, while acknowledging limitations around the need for target utterance text and potential societal risks.

Abstract

The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3.
Paper Structure (14 sections, 4 equations, 5 figures, 5 tables)

This paper contains 14 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: We propose VoiceFormer, a framework for multi-modal speech separation and enhancement, which isolates speech according to either the text content of the target speaker's utterance, their lip movements, or both. Our framework allows conditioning on cues from multiple modalities, without requiring them to be temporally synchronised or have a common temporal rate. This gives it multiple advantages, such as robustness to temporal misalignments between the inputs.
  • Figure 2: Overview of the proposed multi-modal speech enhancement with transformers (VoiceFormer) architecture. it consists of a u-net style encoder-decoder for the audio stream, with the bottleneck layers conditioned on a transformer that can ingest textual and visual modalities. the u-net encoder ingests the raw audio waveform of the target speaker with noise (background or other speakers) and produces a sequence of audio embeddings. the multi-layer transformer conditions on the audio embeddings, the phoneme sequence extracted from the text being spoken, and/or the visual embeddings from the video of the target speaker. the u-net decoder inputs the sequence of refined audio embeddings from the output of the transformer, and produces the clean audio waveform of the target speaker (with the noise removed). In both training and inference, the conditioning can include video or text or both.
  • Figure 3: Attention map visualisations of the first Transformer layer. The visualisations show the average score of the attention heads in the first multi-head attention layer of the transformer. Brighter colours indicate higher scores and brighter pixels on the same row indicate correspondence between modalities. Left: audio and video correlation; and Right: audio and text correlation. Higher scores are given to the audio token and its corresponding token in the other modality at each timestep. This indicates that the model is able to elegantly fuses the mixed/noisy audio stream with the conditioning vectors from different modalities, without the need for explicit alignment between the signals, or requiring them to be operating at the same temporal rate.
  • Figure 4: Experiments with missing information.
  • Figure 5: Robustness to audio-visual misalignment. We compare our proposed model with a baseline using an LSTM bottleneck in the audio-visual speaker separation setting. It is clear that while the LSTM baseline struggles when the video and audio streams are misaligned, VoiceFormer is robust to synchronisation offsets. A five-frame offset corresponds to 200 ms.