Table of Contents
Fetching ...

MaViLS, a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual Features

Katharina Anderer, Andreas Reich, Matthias Wölfel

TL;DR

A novel multimodal algorithm leveraging features from speech, text, and images and dynamic programming shows robustness to some of the challenges associated with video quality and lecture style, underscoring the potential of the approach.

Abstract

This paper presents a benchmark dataset for aligning lecture videos with corresponding slides and introduces a novel multimodal algorithm leveraging features from speech, text, and images. It achieves an average accuracy of 0.82 in comparison to SIFT (0.56) while being approximately 11 times faster. Using dynamic programming the algorithm tries to determine the optimal slide sequence. The results show that penalizing slide transitions increases accuracy. Features obtained via optical character recognition (OCR) contribute the most to a high matching accuracy, followed by image features. The findings highlight that audio transcripts alone provide valuable information for alignment and are beneficial if OCR data is lacking. Variations in matching accuracy across different lectures highlight the challenges associated with video quality and lecture style. The novel multimodal algorithm demonstrates robustness to some of these challenges, underscoring the potential of the approach.

MaViLS, a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual Features

TL;DR

A novel multimodal algorithm leveraging features from speech, text, and images and dynamic programming shows robustness to some of the challenges associated with video quality and lecture style, underscoring the potential of the approach.

Abstract

This paper presents a benchmark dataset for aligning lecture videos with corresponding slides and introduces a novel multimodal algorithm leveraging features from speech, text, and images. It achieves an average accuracy of 0.82 in comparison to SIFT (0.56) while being approximately 11 times faster. Using dynamic programming the algorithm tries to determine the optimal slide sequence. The results show that penalizing slide transitions increases accuracy. Features obtained via optical character recognition (OCR) contribute the most to a high matching accuracy, followed by image features. The findings highlight that audio transcripts alone provide valuable information for alignment and are beneficial if OCR data is lacking. Variations in matching accuracy across different lectures highlight the challenges associated with video quality and lecture style. The novel multimodal algorithm demonstrates robustness to some of these challenges, underscoring the potential of the approach.
Paper Structure (13 sections, 5 equations, 2 figures, 2 tables)

This paper contains 13 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Correlation between F1 score and video quality (7-point Likert). Image quality is shown in blue (solid line, triangle symbols), audio quality in orange (dashed line, cross symbols). Regression lines and 95% confidence intervals (CI) are shown.
  • Figure 2: Left: Correlation between F1 and volatility scores. Right: Correlation between F1 and 'no slide / slide ratio' scores. Regression lines and 95% CI are plotted. Blue (cross, solid) relates to $\lambda^{\text{jump}}=0$, orange (triangle, dashed) to $\lambda^{\text{jump}}=0.1$ and green (point, dotted) to $\lambda^{\text{jump}}=0.2$