Table of Contents
Fetching ...

Missingness-resilient Video-enhanced Multimodal Disfluency Detection

Payal Mohapatra, Shamika Likhite, Subrata Biswas, Bashima Islam, Qi Zhu

TL;DR

This work tackles disfluency detection by moving beyond audio to leverage video cues through a missingness-resilient multimodal framework. It curates a 3.3-hour audiovisual dataset from FluencyBank, and introduces a unified weight-sharing fusion network that projects audio and full-face video into a common latent space, with dynamic modality weighting and a dropout mechanism to handle missing video. Across five disfluency tasks, the approach achieves about a 10 percentage point gain over audio-only baselines when both modalities are present, and maintains a 7-point advantage when video is missing in half of the samples, demonstrating strong robustness and generalization, including zero-shot transfer to another dataset. The work provides open-source code and a practical end-to-end solution that emphasizes the value of full facial context over lip-only cues for paralinguistic disfluency detection, with implications for accessibility and real-world speech processing systems.

Abstract

Most existing speech disfluency detection techniques only rely upon acoustic data. In this work, we present a practical multimodal disfluency detection approach that leverages available video data together with audio. We curate an audiovisual dataset and propose a novel fusion technique with unified weight-sharing modality-agnostic encoders to learn the temporal and semantic context. Our resilient design accommodates real-world scenarios where the video modality may sometimes be missing during inference. We also present alternative fusion strategies when both modalities are assured to be complete. In experiments across five disfluency-detection tasks, our unified multimodal approach significantly outperforms Audio-only unimodal methods, yielding an average absolute improvement of 10% (i.e., 10 percentage point increase) when both video and audio modalities are always available, and 7% even when video modality is missing in half of the samples.

Missingness-resilient Video-enhanced Multimodal Disfluency Detection

TL;DR

This work tackles disfluency detection by moving beyond audio to leverage video cues through a missingness-resilient multimodal framework. It curates a 3.3-hour audiovisual dataset from FluencyBank, and introduces a unified weight-sharing fusion network that projects audio and full-face video into a common latent space, with dynamic modality weighting and a dropout mechanism to handle missing video. Across five disfluency tasks, the approach achieves about a 10 percentage point gain over audio-only baselines when both modalities are present, and maintains a 7-point advantage when video is missing in half of the samples, demonstrating strong robustness and generalization, including zero-shot transfer to another dataset. The work provides open-source code and a practical end-to-end solution that emphasizes the value of full facial context over lip-only cues for paralinguistic disfluency detection, with implications for accessibility and real-world speech processing systems.

Abstract

Most existing speech disfluency detection techniques only rely upon acoustic data. In this work, we present a practical multimodal disfluency detection approach that leverages available video data together with audio. We curate an audiovisual dataset and propose a novel fusion technique with unified weight-sharing modality-agnostic encoders to learn the temporal and semantic context. Our resilient design accommodates real-world scenarios where the video modality may sometimes be missing during inference. We also present alternative fusion strategies when both modalities are assured to be complete. In experiments across five disfluency-detection tasks, our unified multimodal approach significantly outperforms Audio-only unimodal methods, yielding an average absolute improvement of 10% (i.e., 10 percentage point increase) when both video and audio modalities are always available, and 7% even when video modality is missing in half of the samples.
Paper Structure (11 sections, 1 equation, 3 figures, 3 tables)

This paper contains 11 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of our multimodal learning framework for speech disfluency detection. (I) Unified Modality Fusion Network resilient to missing video modalities, (II) Modality-specific early fusion, and (III) Modality-specific late fusion.
  • Figure 2: Performance of our DAV-unified approach on varying availability of visual modality during inference.
  • Figure 3: Performance on zero-shot transfer to a different acoustic dataset (SEP28k).