Table of Contents
Fetching ...

Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach

Mahsa Kadkhodaei Elyaderani, Shahram Shirani

TL;DR

This work tackles speech in-painting under joint audio-visual corruption by introducing a distortion-aware AV seq2seq model with a lip-reading encoder and an audio-visual decoder. It combines pre-processing, post-processing, and a hybrid CTC+MSE loss to reconstruct missing Mel-spectrogram segments, while employing data augmentation and multi-task learning to predict spoken sentences. The proposed AV-MTL-CS2S achieves state-of-the-art results on the Grid Corpus with about 12M parameters, outperforming transformer baselines by 38.8% in quality and 7.14% in intelligibility, and showing robustness to both acoustic and visual distortions. The approach holds practical impact for robust speech reconstruction in challenging environments and motivates future work on multi-speaker data and more diverse noise conditions.

Abstract

The process of reconstructing missing parts of speech audio from context is called speech in-painting. Human perception of speech is inherently multi-modal, involving both audio and visual (AV) cues. In this paper, we introduce and study a sequence-to-sequence (seq2seq) speech in-painting model that incorporates AV features. Our approach extends AV speech in-painting techniques to scenarios where both audio and visual data may be jointly corrupted. To achieve this, we employ a multi-modal training paradigm that boosts the robustness of our model across various conditions involving acoustic and visual distortions. This makes our distortion-aware model a plausible solution for real-world challenging environments. We compare our method with existing transformer-based and recurrent neural network-based models, which attempt to reconstruct missing speech gaps ranging from a few milliseconds to over a second. Our experimental results demonstrate that our novel seq2seq architecture outperforms the state-of-the-art transformer solution by 38.8% in terms of enhancing speech quality and 7.14% in terms of improving speech intelligibility. We exploit a multi-task learning framework that simultaneously performs lip-reading (transcribing video components to text) while reconstructing missing parts of the associated speech.

Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach

TL;DR

This work tackles speech in-painting under joint audio-visual corruption by introducing a distortion-aware AV seq2seq model with a lip-reading encoder and an audio-visual decoder. It combines pre-processing, post-processing, and a hybrid CTC+MSE loss to reconstruct missing Mel-spectrogram segments, while employing data augmentation and multi-task learning to predict spoken sentences. The proposed AV-MTL-CS2S achieves state-of-the-art results on the Grid Corpus with about 12M parameters, outperforming transformer baselines by 38.8% in quality and 7.14% in intelligibility, and showing robustness to both acoustic and visual distortions. The approach holds practical impact for robust speech reconstruction in challenging environments and motivates future work on multi-speaker data and more diverse noise conditions.

Abstract

The process of reconstructing missing parts of speech audio from context is called speech in-painting. Human perception of speech is inherently multi-modal, involving both audio and visual (AV) cues. In this paper, we introduce and study a sequence-to-sequence (seq2seq) speech in-painting model that incorporates AV features. Our approach extends AV speech in-painting techniques to scenarios where both audio and visual data may be jointly corrupted. To achieve this, we employ a multi-modal training paradigm that boosts the robustness of our model across various conditions involving acoustic and visual distortions. This makes our distortion-aware model a plausible solution for real-world challenging environments. We compare our method with existing transformer-based and recurrent neural network-based models, which attempt to reconstruct missing speech gaps ranging from a few milliseconds to over a second. Our experimental results demonstrate that our novel seq2seq architecture outperforms the state-of-the-art transformer solution by 38.8% in terms of enhancing speech quality and 7.14% in terms of improving speech intelligibility. We exploit a multi-task learning framework that simultaneously performs lip-reading (transcribing video components to text) while reconstructing missing parts of the associated speech.
Paper Structure (30 sections, 7 equations, 4 figures, 6 tables)

This paper contains 30 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: An illustration of the seq2seq model for speech in-painting proposed in this paper. The encoder or the lip-reader (on the left) takes cropped video frames and outputs corresponding transcriptions. The decoder or the speech in-painter (on the right) uses both spectral and visual features from the encoder to restore distorted spectrograms.
  • Figure 2: Effects of gap sizes on the reconstructed audio quality and intelligibility plus spectrograms quality in the informed case for the unseen speakers' test set. The horizontal axis of all diagrams shows the duration of gaps in milliseconds, and the vertical axis is the models' performances in terms of each evaluation metric.
  • Figure 3: Qualitative results of in-painted Mel-spectrograms for AV-SI Morrone, AV-MTL-S2S me, and AV-MTL-CS2S models. The distorted areas are the areas of interest in red boxes and are zoomed in for better visualization. The first two rows exhibit the first example in the informed case, and the last two are the second example in the uninformed case.
  • Figure 4: Qualitative results of in-painted Mel-spectrograms for AUG-AV-SI Morrone, AUG-AV-MTL-S2S me, and AUG-AV-MTL-CS2S models. The distorted areas are the areas of interest in red boxes and are zoomed in for better visualization. The first two rows exhibit the first example with additive white noise, the third and fourth rows are the second example with added environmental sounds, and the last two rows have synchronous masked AV data.