What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Dongqi Liu; Chenxi Whitehouse; Xi Yu; Louis Mahon; Rohit Saxena; Zheng Zhao; Yifu Qiu; Mirella Lapata; Vera Demberg

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Dongqi Liu, Chenxi Whitehouse, Xi Yu, Louis Mahon, Rohit Saxena, Zheng Zhao, Yifu Qiu, Mirella Lapata, Vera Demberg

TL;DR

The paper presents VISTA, a large-scale dataset for video-to-text summarization in scientific contexts, pairing 18,599 conference presentations with their abstracts. It shows that video-grounded large multimodal models outperform text- or audio-based baselines, especially when fine-tuned in-domain, but still lag behind human performance. To address structure and factual grounding, the authors propose a plan-based framework that first generates a latent plan (a sequence of guiding questions) and then produces the final summary, achieving consistent improvements over state-of-the-art baselines. Through extensive experiments, ablations, and human evaluation, the study demonstrates the value of explicit planning for discourse-aware summarization while highlighting remaining challenges, such as hallucinations and alignment gaps. The work establishes VISTA as a robust benchmark and advocates planning as a generalizable scaffold for multimodal scientific summarization and future research directions.

Abstract

Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

TL;DR

Abstract

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (21)