Table of Contents
Fetching ...

Paper2Video: Automatic Video Generation from Scientific Papers

Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou

TL;DR

Paper2Video addresses the bottleneck of producing academic presentation videos by introducing a benchmark of 101 papers paired with author-made presentations and slides, plus four tailored evaluation metrics. It then proposes PaperTalker, a multi-agent system that generates slide layouts in LaTeX Beamer, aligns subtitles and cursor trajectories, and renders personalized talking-head videos in parallel. Experiments show PaperTalker outperforms strong baselines in fidelity and informativeness and achieves comparable quality to human presentations while delivering over sixfold efficiency. The work provides a practical path for automated, ready-to-use academic presentation videos and releases data, code, and models to the research community.

Abstract

Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce Paper2Video, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

Paper2Video: Automatic Video Generation from Scientific Papers

TL;DR

Paper2Video addresses the bottleneck of producing academic presentation videos by introducing a benchmark of 101 papers paired with author-made presentations and slides, plus four tailored evaluation metrics. It then proposes PaperTalker, a multi-agent system that generates slide layouts in LaTeX Beamer, aligns subtitles and cursor trajectories, and renders personalized talking-head videos in parallel. Experiments show PaperTalker outperforms strong baselines in fidelity and informativeness and achieves comparable quality to human presentations while delivering over sixfold efficiency. The work provides a practical path for automated, ready-to-use academic presentation videos and releases data, code, and models to the research community.

Abstract

Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce Paper2Video, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

Paper Structure

This paper contains 25 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: This work solves two core problems for academic presentations: Left:how to create a presentation video from a paper? PaperTalker -- an agent integrates slide, subtitling, cursor grounding, speech synthesis, and talking-head video rendering. Right:how to evaluate a presentation video? Paper2Video -- a benchmark with well-designed metrics to evaluate presentation quality.
  • Figure 2: Statistics of Paper2Video benchmark. It spans diverse topics, with presentations comprising 4--28 slides and lasting 2--14 min, providing a valuable benchmark for the automatic generation and evaluation of academic presentation videos.
  • Figure 3: Overview of evaluation metrics. We propose three metrics that systematically evaluate academic presentation video generation from the perspective of the relationship between the generated video and (i) the original paper and (ii) the human-made video.
  • Figure 4: Overview of PaperTalker. Our pipeline comprises three key modules: (i) tree search visual choice for fine-grained slide layout optimization; (ii) a GUI-grounded model paired with WhisperX for spatiotemporally aligned cursor grounding; and (iii) slide-wise parallel generation for efficiency.
  • Figure 5: Tree Search Visual Choice. It combines a rule-based proposal mechanism with VLM-based scoring to select the optimal candidate.
  • ...and 3 more figures