Table of Contents
Fetching ...

When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse

Yihuan Huang, Jun Xue, Liu Jiajun, Daixian Li, Tong Zhang, Zhuolin Yi, Yanzhen Ren, Kai Li

Abstract

Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbf{MLD-VC}, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at https://huggingface.co/datasets/nccm2p2/MLD-VC.

When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse

Abstract

Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbf{MLD-VC}, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at https://huggingface.co/datasets/nccm2p2/MLD-VC.
Paper Structure (48 sections, 8 figures, 7 tables)

This paper contains 48 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Compared to the face-to-face scenario, we identify two key factors affecting AVSR in video conferencing: transmission distortions in online and hyper-expression in a hindered communication environment.
  • Figure 2: Probability density curves of five acoustic features (F0, F1, F2, Loudness, and AlphaRatio) across five subsets in the proposed MLD-VC under Plain (left column) and Lombard (right column). Across features(column-wise), both Plain and Lombard show noticeable high-frequency shifts in F1 and F2 under video conferencing conditions. Across conditions (row-wise), Lombard also exhibits overall higher F1 and F2 than Plain.
  • Figure 3: Illustration of how video conferencing platforms affect speech formants. We simulate platform processing by applying OPUS compression and three typical speech enhancement algorithms. The four subplots show F1 and F2 distributions before and after processing. Only the enhancement stages introduce noticeable spectral shifts and formant distortions, which explain the frequency bias observed in real video conferencing recordings.
  • Figure 4: Picture of the recording environment.
  • Figure 5: Duration distribution across subsets of the proposed MLD-VC dataset. “Offline” refers to the subset of recorded content that was captured before video conferencing.
  • ...and 3 more figures