Table of Contents
Fetching ...

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Hang Hua, Yolo Yunlong Tang, Chenliang Xu, Jiebo Luo

TL;DR

V2Xum-LLaMA addresses the scarcity of large-scale cross-modal video summarization data by introducing Instruct-V2Xum, a 30k-video dataset with frame-indexed summaries. It unifies V2V, V2T, and V2VT into a single LLM decoder using interleaved frames and temporal prompts. The authors propose F_CLIP and Cross-F_CLIP evaluation metrics to better reflect semantic similarity and cross-modal alignment, and demonstrate strong performance across summarization tasks, highlighting practical impact for scalable multimodal video understanding.

Abstract

Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective training of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

TL;DR

V2Xum-LLaMA addresses the scarcity of large-scale cross-modal video summarization data by introducing Instruct-V2Xum, a 30k-video dataset with frame-indexed summaries. It unifies V2V, V2T, and V2VT into a single LLM decoder using interleaved frames and temporal prompts. The authors propose F_CLIP and Cross-F_CLIP evaluation metrics to better reflect semantic similarity and cross-modal alignment, and demonstrate strong performance across summarization tasks, highlighting practical impact for scalable multimodal video understanding.

Abstract

Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective training of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.
Paper Structure (31 sections, 9 equations, 7 figures, 5 tables)

This paper contains 31 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of cross-modal video summarization.
  • Figure 2: The architecture of the proposed V2Xum-LLaMA.
  • Figure 3: Comparison of the Vera scores for the dataset before (left) and after (right) refinement, with higher scores indicating better results.
  • Figure 4: Comparison of the Grammar scores for the dataset before (left) and after (right) refinement, with higher scores indicating better results.
  • Figure 5: Some videos with video summaries and text summaries annotations from our Instruct-V2Xum dataset.
  • ...and 2 more figures