Sequence-to-Sequence Multi-Modal Speech In-Painting
Mahsa Kadkhodaei Elyaderani, Shahram Shirani
TL;DR
This work tackles speech in-painting under long-duration distortions by incorporating visual information from lip movements into a sequence-to-sequence framework. An encoder acts as a lip-reader, converting mouth-region cues into latent representations, while a decoder reconstructs missing audio spectrogram segments conditioned on both visual features and degraded audio, trained end-to-end with a joint loss that combines $loss = \text{MSE} + \lambda \cdot \text{CTC}$. On the Grid Corpus, the approach, particularly the AV-MTL-S2S variant, achieves higher PESQ and STOI compared to audio-only baselines and surpasses or matches prior multi-modal methods for distortions between 300 and 1500 ms, demonstrating the value of visual-audio fusion and multi-task learning in speech in-painting. The results highlight the practical potential of multi-modal speech restoration in communication systems facing transmission or recording impairments, with future work pointing to transformer-based architectures and cross-attention mechanisms to further improve performance and robustness.
Abstract
Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech in-painting model and has comparable results with a recent multi-modal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting.
