Table of Contents
Fetching ...

Prompts to Summaries: Zero-Shot Language-Guided Video Summarization with Large Language and Video Models

Mario Barbara, Alaa Maalouf

Abstract

The explosive growth of video data intensified the need for flexible user-controllable summarization tools that operate without training data. Existing methods either rely on domain-specific datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video-summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large-language-models (LLMs) judging, without the use of training data, beating unsupervised and matching supervised methods. Our pipeline (i) segments video into scenes, (ii) produces scene descriptions with a memory-efficient batch prompting scheme that scales to hours on a single GPU, (iii) scores scene importance with an LLM via tailored prompts, and (iv) propagates scores to frames using new consistency (temporal coherence) and uniqueness (novelty) metrics for fine-grained frame importance. On SumMe and TVSum, our approach surpasses all prior data-hungry unsupervised methods and performs competitively on the Query-Focused Video Summarization benchmark, where the competing methods require supervised frame-level importance. We release VidSum-Reason, a query-driven dataset featuring long-tailed concepts and multi-step reasoning, where our framework serves as the first challenging baseline. Overall, we demonstrate that pretrained multi-modal models, when orchestrated with principled prompting and score propagation, provide a powerful foundation for universal, text-queryable video summarization.

Prompts to Summaries: Zero-Shot Language-Guided Video Summarization with Large Language and Video Models

Abstract

The explosive growth of video data intensified the need for flexible user-controllable summarization tools that operate without training data. Existing methods either rely on domain-specific datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video-summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large-language-models (LLMs) judging, without the use of training data, beating unsupervised and matching supervised methods. Our pipeline (i) segments video into scenes, (ii) produces scene descriptions with a memory-efficient batch prompting scheme that scales to hours on a single GPU, (iii) scores scene importance with an LLM via tailored prompts, and (iv) propagates scores to frames using new consistency (temporal coherence) and uniqueness (novelty) metrics for fine-grained frame importance. On SumMe and TVSum, our approach surpasses all prior data-hungry unsupervised methods and performs competitively on the Query-Focused Video Summarization benchmark, where the competing methods require supervised frame-level importance. We release VidSum-Reason, a query-driven dataset featuring long-tailed concepts and multi-step reasoning, where our framework serves as the first challenging baseline. Overall, we demonstrate that pretrained multi-modal models, when orchestrated with principled prompting and score propagation, provide a powerful foundation for universal, text-queryable video summarization.

Paper Structure

This paper contains 95 sections, 7 equations, 17 figures, 8 tables, 4 algorithms.

Figures (17)

  • Figure 1: Prompts to summaries overview. Our zero-shot pipeline turns any video + text query into a tailored highlight reel. We (a) detect scenes and snap boundaries with visual embeddings, (b) caption each scene via a video-language model, and (c) let an LLM rank captions against the query alongside its importance as part of the whole video. (d) Smoothed relevance scores weight every frame, yielding a query-aligned summary. No extra training, ready for personal highlights, generalizes across domains and handles long-tailed, reasoning-heavy queries.
  • Figure 2: Illustrative example of our video summarization pipeline. The figure presents the step-by-step transformation for two sample videos from the SumMe dataset (Video_5, Video_9) and two from the TVSum dataset (Video_26, Video_41). (a) Initial scene boundaries are detected from raw frames. (b) Scene boundaries are refined based on embedding similarity. (c) Scene-level scores are computed using a language model conditioned on a user query (based on descriptions of each scene generate from a VideoLM). (d) Scores are normalized and temporally smoothed to produce frame-level importance scores. (e) The predicted frame-level scores are compared with averaged user annotations, highlighting alignment between the model’s predictions and user intent.
  • Figure 3: Visualization of selections made by our method, user summaries, and KTS segments. Blue bars: human summaries; yellow bars: final summary fragments; green bars: our model’s selected frames; red bars: frames not selected. Video frames from each selected shot are included for better visualization.
  • Figure 4: Result of our method on the QFVS dataset. The middle columns framed by black outlines showcase frames sampled from Video 3 (3 hours long) in the QFVS dataset. Summary 1 shows sampled frames for the query "Focus on scenes containing Chair and/or Tree." Summary 2 shows sampled frames for the query "Focus on scenes containing Food and/or Hands." Green boxes highlight the entities specified in each query.
  • Figure 5: Visualization of query-guided video summarization. The top plot shows ground-truth importance scores (blue) with user summary in red boxes; the bottom plot shows predicted scores with summaries in green boxes. Sample frames illustrate alignment with the query "Highlight crossover vehicles." Frames common to both summaries are highlighted in green; key events in black.
  • ...and 12 more figures