Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach
Mahsa Kadkhodaei Elyaderani, Shahram Shirani
TL;DR
This work tackles speech in-painting under joint audio-visual corruption by introducing a distortion-aware AV seq2seq model with a lip-reading encoder and an audio-visual decoder. It combines pre-processing, post-processing, and a hybrid CTC+MSE loss to reconstruct missing Mel-spectrogram segments, while employing data augmentation and multi-task learning to predict spoken sentences. The proposed AV-MTL-CS2S achieves state-of-the-art results on the Grid Corpus with about 12M parameters, outperforming transformer baselines by 38.8% in quality and 7.14% in intelligibility, and showing robustness to both acoustic and visual distortions. The approach holds practical impact for robust speech reconstruction in challenging environments and motivates future work on multi-speaker data and more diverse noise conditions.
Abstract
The process of reconstructing missing parts of speech audio from context is called speech in-painting. Human perception of speech is inherently multi-modal, involving both audio and visual (AV) cues. In this paper, we introduce and study a sequence-to-sequence (seq2seq) speech in-painting model that incorporates AV features. Our approach extends AV speech in-painting techniques to scenarios where both audio and visual data may be jointly corrupted. To achieve this, we employ a multi-modal training paradigm that boosts the robustness of our model across various conditions involving acoustic and visual distortions. This makes our distortion-aware model a plausible solution for real-world challenging environments. We compare our method with existing transformer-based and recurrent neural network-based models, which attempt to reconstruct missing speech gaps ranging from a few milliseconds to over a second. Our experimental results demonstrate that our novel seq2seq architecture outperforms the state-of-the-art transformer solution by 38.8% in terms of enhancing speech quality and 7.14% in terms of improving speech intelligibility. We exploit a multi-task learning framework that simultaneously performs lip-reading (transcribing video components to text) while reconstructing missing parts of the associated speech.
