Table of Contents
Fetching ...

Sequence-to-Sequence Multi-Modal Speech In-Painting

Mahsa Kadkhodaei Elyaderani, Shahram Shirani

TL;DR

This work tackles speech in-painting under long-duration distortions by incorporating visual information from lip movements into a sequence-to-sequence framework. An encoder acts as a lip-reader, converting mouth-region cues into latent representations, while a decoder reconstructs missing audio spectrogram segments conditioned on both visual features and degraded audio, trained end-to-end with a joint loss that combines $loss = \text{MSE} + \lambda \cdot \text{CTC}$. On the Grid Corpus, the approach, particularly the AV-MTL-S2S variant, achieves higher PESQ and STOI compared to audio-only baselines and surpasses or matches prior multi-modal methods for distortions between 300 and 1500 ms, demonstrating the value of visual-audio fusion and multi-task learning in speech in-painting. The results highlight the practical potential of multi-modal speech restoration in communication systems facing transmission or recording impairments, with future work pointing to transformer-based architectures and cross-attention mechanisms to further improve performance and robustness.

Abstract

Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech in-painting model and has comparable results with a recent multi-modal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting.

Sequence-to-Sequence Multi-Modal Speech In-Painting

TL;DR

This work tackles speech in-painting under long-duration distortions by incorporating visual information from lip movements into a sequence-to-sequence framework. An encoder acts as a lip-reader, converting mouth-region cues into latent representations, while a decoder reconstructs missing audio spectrogram segments conditioned on both visual features and degraded audio, trained end-to-end with a joint loss that combines . On the Grid Corpus, the approach, particularly the AV-MTL-S2S variant, achieves higher PESQ and STOI compared to audio-only baselines and surpasses or matches prior multi-modal methods for distortions between 300 and 1500 ms, demonstrating the value of visual-audio fusion and multi-task learning in speech in-painting. The results highlight the practical potential of multi-modal speech restoration in communication systems facing transmission or recording impairments, with future work pointing to transformer-based architectures and cross-attention mechanisms to further improve performance and robustness.

Abstract

Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech in-painting model and has comparable results with a recent multi-modal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting.
Paper Structure (10 sections, 4 equations, 2 figures, 1 table)

This paper contains 10 sections, 4 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: An illustration of the proposed sequence-to-sequence model for speech in-painting. The encoder (in the left) takes the motion vectors from cropped video frames and the decoder (in the right) takes both spectrograms and visual features from the encoder. The model outputs in-painted spectrograms plus corresponding transcriptions.
  • Figure 2: Qualitative results of in-painting distorted spectrograms for different methods. The masked areas are the areas of interest which are placed in red boxes and are zoomed-in for better visualization. The first two rows correspond to the first example and the last two are for the second example.