Table of Contents
Fetching ...

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael Rubinstein

TL;DR

This work tackles the cocktail party problem in video by introducing a speaker-independent audio-visual speech separation model that uses face embeddings as visual cues to isolate target speakers. It pairs a dilated convolutional audio stream with a visual stream and fuses them through a BLSTM to predict per-speaker complex spectrogram masks, trained on a new large AV dataset, AVSpeech. The method outperforms audio-only baselines across synthetic multi-speaker scenarios and demonstrates robust performance in real-world videos, while enabling applications in video transcription and post-processing. The AVSpeech dataset and extensive ablation analyses establish the value of visual information for both separation quality and speaker-to-face association, marking a significant step toward practical AV speech separation.

Abstract

We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

TL;DR

This work tackles the cocktail party problem in video by introducing a speaker-independent audio-visual speech separation model that uses face embeddings as visual cues to isolate target speakers. It pairs a dilated convolutional audio stream with a visual stream and fuses them through a BLSTM to predict per-speaker complex spectrogram masks, trained on a new large AV dataset, AVSpeech. The method outperforms audio-only baselines across synthetic multi-speaker scenarios and demonstrates robust performance in real-world videos, while enabling applications in video transcription and post-processing. The AVSpeech dataset and extensive ablation analyses establish the value of visual information for both separation quality and speaker-to-face association, marking a significant step toward practical AV speech separation.

Abstract

We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).

Paper Structure

This paper contains 38 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: AVSpeech dataset: We first gathered a large collection of 290,000 high-quality, online public videos of talks and lectures (a). From these videos we extracted segments with clean speech (e.g. no mixed music, audience sounds or other speakers), and with the speaker visible in the frame (see Section \ref{['sec:dataset']} and Figure \ref{['fig:dataset_pipeline']}, for details of the processing). This resulted in 4700 hours of video clips, each of a single person talking with no background interference (b). This data spans a wide variety of people, languages, and face poses, with distributions shown in (c) (age and head angles estimated with automatic classifiers; language based on YouTube metadata). For a detailed list of video sources in our dataset please refer to the project web page.
  • Figure 2: Video and audio processing for dataset creation: (a) We use face detection and tracking to extract speech segment candidates from videos and reject frames in which faces are blurred or not sufficiently frontal-facing. (b) We discard segments with noisy speech by estimating speech SNR (see Section \ref{['sec:dataset']}). The plot is intended to show the accuracy of our speech SNR estimator (and thus the quality of the dataset). We compare true speech SNR with our predicted SNR for synthetic mixtures of clean speech and non-speech noise at known SNR levels. Predicted SNR values (in dB) are averaged over $60$ generated mixtures per SNR bin, with error bars representing 1 std. We discard segments for which the predicted speech SNR is below 17 dB (marked by the gray dotted line in the plot).
  • Figure 3: Our model's multi-stream neural network-based architecture: The visual streams take as input thumbnails of detected faces in each frame in the video, and the audio stream takes as input the video's soundtrack, containing a mixture of speech and background noise. The visual streams extract face embeddings for each thumbnail using a pretrained face recognition model, then learn a visual feature using a dilated convolutional NN. The audio stream first computes the STFT of the input signal to obtain a spectrogram, and then learns an audio representation using a similar dilated convolutional NN. A joint, audio-visual representation is then created by concatenating the learned visual and audio features, and is subsequently further processed using a bidirectional LSTM and three fully connected layers. The network outputs a complex spectrogram mask for each speaker, which is multiplied by the noisy input, and converted back to waveforms to obtain an isolated speech signal for each speaker.
  • Figure 4: Input SDR vs. output SDR improvement: A scatter plot showing separation performance (SDR improvement) as a function of original (noisy) SDR for the task of separating two clean speakers (2S clean). Each point corresponds to a single, 3-second audio-visual sample from the test set.
  • Figure 5: Example of input and output audio: The top row shows the audio spectrogram for one segment in our training data, involving two speakers and background noise (a), together with the ground truth, separate spectrograms of each speaker (b, c). In the bottom row we show our results: the masks our method estimates for that segment, superimposed on one spectrogram with a different color for each speaker (d), and the corresponding output spectrograms for each speaker (e, f).
  • ...and 3 more figures