Table of Contents
Fetching ...

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Dongqi Liu, Chenxi Whitehouse, Xi Yu, Louis Mahon, Rohit Saxena, Zheng Zhao, Yifu Qiu, Mirella Lapata, Vera Demberg

TL;DR

The paper presents VISTA, a large-scale dataset for video-to-text summarization in scientific contexts, pairing 18,599 conference presentations with their abstracts. It shows that video-grounded large multimodal models outperform text- or audio-based baselines, especially when fine-tuned in-domain, but still lag behind human performance. To address structure and factual grounding, the authors propose a plan-based framework that first generates a latent plan (a sequence of guiding questions) and then produces the final summary, achieving consistent improvements over state-of-the-art baselines. Through extensive experiments, ablations, and human evaluation, the study demonstrates the value of explicit planning for discourse-aware summarization while highlighting remaining challenges, such as hallucinations and alignment gaps. The work establishes VISTA as a robust benchmark and advocates planning as a generalizable scaffold for multimodal scientific summarization and future research directions.

Abstract

Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

TL;DR

The paper presents VISTA, a large-scale dataset for video-to-text summarization in scientific contexts, pairing 18,599 conference presentations with their abstracts. It shows that video-grounded large multimodal models outperform text- or audio-based baselines, especially when fine-tuned in-domain, but still lag behind human performance. To address structure and factual grounding, the authors propose a plan-based framework that first generates a latent plan (a sequence of guiding questions) and then produces the final summary, achieving consistent improvements over state-of-the-art baselines. Through extensive experiments, ablations, and human evaluation, the study demonstrates the value of explicit planning for discourse-aware summarization while highlighting remaining challenges, such as hallucinations and alignment gaps. The work establishes VISTA as a robust benchmark and advocates planning as a generalizable scaffold for multimodal scientific summarization and future research directions.

Abstract

Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.

Paper Structure

This paper contains 54 sections, 21 figures, 14 tables.

Figures (21)

  • Figure 1: An example from VISTA: a conference presentation video (top) paired with the abstract of the corresponding paper (bottom). This data sample mallen-etal-2023-trust was presented at ACL 2023 and received the Best Video Recordings award.
  • Figure 2: Venue distribution of the VISTA dataset.
  • Figure 3: Distribution of summary sentences, summary tokens, video durations, and video shots in VISTA.
  • Figure 4: GPT-o1 generates plans based on reference summaries. Each question $q_i$ corresponds to a summary sentence $t_i$, which we assume constitutes its answer. Index $i$ ranges from $1$ to the number of summary sentences.
  • Figure 5: Noise in plan generation impacts summarization performance. FRR is a shorthand for Full Random Replacement, and RR for Random Replacement. RAST is a SOTA question generation method.
  • ...and 16 more figures