Table of Contents
Fetching ...

Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment

Joanna Hong, Sanjeel Parekh, Honglie Chen, Jacob Donley, Ke Tan, Buye Xu, Anurag Kumar

TL;DR

The paper tackles the practical challenge of deploying multimodal speech systems in resource-constrained settings by introducing MUTUD, a framework for Multimodal Training and Unimodal Deployment. It combines a Temporally Aligned Modality feature Estimation (TAME) module with modality-specific codebooks to recall missing modalities from available ones during inference, enabling unimodal deployment with performance close to multimodal models. Across AVSE, AVSR, and AV-ASD, MUTUD achieves substantial efficiency gains (fewer parameters and MACs) while narrowing the performance gap to audiovisual systems, and it generalizes to audio-only data. The approach is generic, scalable to more modalities, and promising for on-device audiovisual speech processing in real-world scenarios.

Abstract

Building reliable speech systems often requires combining multiple modalities, like audio and visual cues. While such multimodal solutions frequently lead to improvements in performance and may even be critical in certain cases, they come with several constraints such as increased sensory requirements, computational cost, and modality synchronization, to mention a few. These challenges constrain the direct uses of these multimodal solutions in real-world applications. In this work, we develop approaches where the learning happens with all available modalities but the deployment or inference is done with just one or reduced modalities. To do so, we propose a Multimodal Training and Unimodal Deployment (MUTUD) framework which includes a Temporally Aligned Modality feature Estimation (TAME) module that can estimate information from missing modality using modalities present during inference. This innovative approach facilitates the integration of information across different modalities, enhancing the overall inference process by leveraging the strengths of each modality to compensate for the absence of certain modalities during inference. We apply MUTUD to various audiovisual speech tasks and show that it can reduce the performance gap between the multimodal and corresponding unimodal models to a considerable extent. MUTUD can achieve this while reducing the model size and compute compared to multimodal models, in some cases by almost 80%.

Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment

TL;DR

The paper tackles the practical challenge of deploying multimodal speech systems in resource-constrained settings by introducing MUTUD, a framework for Multimodal Training and Unimodal Deployment. It combines a Temporally Aligned Modality feature Estimation (TAME) module with modality-specific codebooks to recall missing modalities from available ones during inference, enabling unimodal deployment with performance close to multimodal models. Across AVSE, AVSR, and AV-ASD, MUTUD achieves substantial efficiency gains (fewer parameters and MACs) while narrowing the performance gap to audiovisual systems, and it generalizes to audio-only data. The approach is generic, scalable to more modalities, and promising for on-device audiovisual speech processing in real-world scenarios.

Abstract

Building reliable speech systems often requires combining multiple modalities, like audio and visual cues. While such multimodal solutions frequently lead to improvements in performance and may even be critical in certain cases, they come with several constraints such as increased sensory requirements, computational cost, and modality synchronization, to mention a few. These challenges constrain the direct uses of these multimodal solutions in real-world applications. In this work, we develop approaches where the learning happens with all available modalities but the deployment or inference is done with just one or reduced modalities. To do so, we propose a Multimodal Training and Unimodal Deployment (MUTUD) framework which includes a Temporally Aligned Modality feature Estimation (TAME) module that can estimate information from missing modality using modalities present during inference. This innovative approach facilitates the integration of information across different modalities, enhancing the overall inference process by leveraging the strengths of each modality to compensate for the absence of certain modalities during inference. We apply MUTUD to various audiovisual speech tasks and show that it can reduce the performance gap between the multimodal and corresponding unimodal models to a considerable extent. MUTUD can achieve this while reducing the model size and compute compared to multimodal models, in some cases by almost 80%.

Paper Structure

This paper contains 22 sections, 9 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: (a) The left panel shows a comparison between conventional audiovisual speech processing and MUTUD. TAME module enables audiovisual learning without doing video processing during prediction. (b) The upper half in the right panel illustrates MUTUD for an AVSE model. After training the video encoder is discarded. (c) The bottom half in the right panel shows the estimation of video representations using TAME. The illustration is for $t=0$ in Eq \ref{['eq:vid_real_ret']}.
  • Figure 2: MUTUD bridges the gap between Audiovisual and Audio-only models. Performance (in %) of different methods relative to the gain Audiovisual brings in average intelligibility (STOI) of noisy speech samples. MUTUD is able to recover most of performance gains of the Audiovisual model across different SNRs. For example, at -5dB SNR Audio-only is at 86.0% of Audioivisual model whereas MUTUD is at 93.4% of Audiovisual model.
  • Figure 3: Cosine similarity (red) and $\ell_2$ distance (blue) between video features and estimated video features, video and audio features, and estimated video and audio features for different SNRs.
  • Figure 4: TSNE visualization of the estimated video features $\hat{F}_{v}$, the actual video features ${F}_{v}$, and the audio features ${F}_{a}$ for SNRs ranging from 5dB to -15dB.
  • Figure 5: Visualization of the learned audio and video codebooks ($\bm{C}^{a}$ and $\bm{C}^{v}$). The plots show the mean of each code in all K(=4) codebooks.
  • ...and 1 more figures