Table of Contents
Fetching ...

The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition

Ming Gao, Shilong Wu, Hang Chen, Jun Du, Chin-Hui Lee, Shinji Watanabe, Jingdong Chen, Siniscalchi Sabato Marco, Odette Scharenborg

TL;DR

The paper presents the MISP 2025 Challenge, a multimodal benchmark for meeting transcription that integrates video with audio to tackle AVSD, AVSR, and AVDR. It introduces the MISP-Meeting dataset, provides robust baselines across all tasks, and analyzes participant approaches, reporting strong gains over baselines in DER, CER, and cpCER. Key findings show DER around 8%, CER near 9.5%, and cpCER around 11.6% for top systems, underscoring the value and challenges of multimodal fusion in real-world meetings. The work advances multimodal speech processing by highlighting effective fusion strategies and outlining directions for improved fusion, domain adaptation, and dataset growth.

Abstract

Meetings are a valuable yet challenging scenario for speech applications due to complex acoustic conditions. This paper summarizes the outcomes of the MISP 2025 Challenge, hosted at Interspeech 2025, which focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio. The tasks include Audio-Visual Speaker Diarization (AVSD), Audio-Visual Speech Recognition (AVSR), and Audio-Visual Diarization and Recognition (AVDR). We present the challenge's objectives, tasks, dataset, baseline systems, and solutions proposed by participants. The best-performing systems achieved significant improvements over the baseline: the top AVSD model achieved a Diarization Error Rate (DER) of 8.09%, improving by 7.43%; the top AVSR system achieved a Character Error Rate (CER) of 9.48%, improving by 10.62%; and the best AVDR system achieved a concatenated minimum-permutation Character Error Rate (cpCER) of 11.56%, improving by 72.49%.

The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition

TL;DR

The paper presents the MISP 2025 Challenge, a multimodal benchmark for meeting transcription that integrates video with audio to tackle AVSD, AVSR, and AVDR. It introduces the MISP-Meeting dataset, provides robust baselines across all tasks, and analyzes participant approaches, reporting strong gains over baselines in DER, CER, and cpCER. Key findings show DER around 8%, CER near 9.5%, and cpCER around 11.6% for top systems, underscoring the value and challenges of multimodal fusion in real-world meetings. The work advances multimodal speech processing by highlighting effective fusion strategies and outlining directions for improved fusion, domain adaptation, and dataset growth.

Abstract

Meetings are a valuable yet challenging scenario for speech applications due to complex acoustic conditions. This paper summarizes the outcomes of the MISP 2025 Challenge, hosted at Interspeech 2025, which focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio. The tasks include Audio-Visual Speaker Diarization (AVSD), Audio-Visual Speech Recognition (AVSR), and Audio-Visual Diarization and Recognition (AVDR). We present the challenge's objectives, tasks, dataset, baseline systems, and solutions proposed by participants. The best-performing systems achieved significant improvements over the baseline: the top AVSD model achieved a Diarization Error Rate (DER) of 8.09%, improving by 7.43%; the top AVSR system achieved a Character Error Rate (CER) of 9.48%, improving by 10.62%; and the best AVDR system achieved a concatenated minimum-permutation Character Error Rate (cpCER) of 11.56%, improving by 72.49%.

Paper Structure

This paper contains 12 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Statistics of meeting rooms.
  • Figure 2: Example of recording venue, and used devices.