Table of Contents
Fetching ...

M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang

TL;DR

This paper introduces $M^3$AV, a large multimodal dataset of ~367 hours of open academic lectures with aligned slides, speech, OCR, and papers to support both multimodal content recognition and higher-level knowledge understanding. It details a comprehensive data creation pipeline, including candidate generation for transcription, rigorous slide annotation, and rule-based merging to produce high-quality aligned annotations. The authors establish three benchmarks—ASR/CASR, spontaneous TTS, and slide-script generation—and report that while standard models achieve reasonable performance, there is substantial room for improvement, especially in rare-word handling and integrating external knowledge. The dataset and benchmarks aim to catalyze advances in multimodal understanding and knowledge-driven generation for academic content, with practical implications for accessible education and research tooling.

Abstract

Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations. In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (M$^3$AV), which has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics. With high-quality human annotations of the slide text and spoken words, in particular high-valued name entities, the dataset can be used for multiple audio-visual recognition and understanding tasks. Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation tasks demonstrate that the diversity of M$^3$AV makes it a challenging dataset.

M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

TL;DR

This paper introduces AV, a large multimodal dataset of ~367 hours of open academic lectures with aligned slides, speech, OCR, and papers to support both multimodal content recognition and higher-level knowledge understanding. It details a comprehensive data creation pipeline, including candidate generation for transcription, rigorous slide annotation, and rule-based merging to produce high-quality aligned annotations. The authors establish three benchmarks—ASR/CASR, spontaneous TTS, and slide-script generation—and report that while standard models achieve reasonable performance, there is substantial room for improvement, especially in rare-word handling and integrating external knowledge. The dataset and benchmarks aim to catalyze advances in multimodal understanding and knowledge-driven generation for academic content, with practical implications for accessible education and research tooling.

Abstract

Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations. In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (MAV), which has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics. With high-quality human annotations of the slide text and spoken words, in particular high-valued name entities, the dataset can be used for multiple audio-visual recognition and understanding tasks. Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation tasks demonstrate that the diversity of MAV makes it a challenging dataset.
Paper Structure (54 sections, 4 equations, 17 figures, 10 tables)

This paper contains 54 sections, 4 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: The overview of our M$^3$AV dataset. The first component is slides annotated with simple and complex blocks. They will be merged following some rules. The second component is speech containing special vocabulary, spoken and written forms, and word-level timestamps. The third component is the paper corresponding to the video. The asterisk (*) denotes that only computer science videos have corresponding papers.
  • Figure 2: Statistics of our dataset. (a) shows video duration and numbers. (b) shows the number of slides. (c) shows the number of words per slide. (d) shows the duration per slide segment. (e) shows the speech word frequency.
  • Figure 3: Diagram illustration of the process of creating speech transcription.
  • Figure 4: Diagram illustration of the process of candidate combination.
  • Figure 5: Diagram illustration of the process of slide annotation. Shades of the same colour represent the amount of slide content in the same segment. For example, the dark green page adds some content to the light green page. The right sign (✔) represents reservation, while the wrong sign (✕) represents discarding.
  • ...and 12 more figures