Table of Contents
Fetching ...

Language-Guided Graph Representation Learning for Video Summarization

Wenrui Li, Wei Han, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan, Yonghong Tian

TL;DR

This work tackles personalized video summarization by enabling language-guided, query-focused control while preserving global temporal and semantic dependencies. It introduces LGRLN, a lightweight graph-based framework that transforms video frames into forward, backward, and undirected graphs, applies a dual-threshold relational reasoning module, and fuses textual queries through a language-guided cross-modal embedding, trained with a mixture-of-Bernoulli EM objective. Empirical results on SumMe, TVSum, VideoXum, and QFVS demonstrate state-of-the-art or competitive performance with a fraction of large-language-model parameters, highlighting strong efficiency and scalability for resource-constrained settings. The approach offers practical impact for personalized, multimodal video summarization and sets a foundation for broader on-device, cross-modal video understanding with controlled fusion of language inputs.

Abstract

With the rapid growth of video content on social media, video summarization has become a crucial task in multimedia processing. However, existing methods face challenges in capturing global dependencies in video content and accommodating multimodal user customization. Moreover, temporal proximity between video frames does not always correspond to semantic proximity. To tackle these challenges, we propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization. Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies. By constructing forward, backward and undirected graphs, the video graph generator effectively preserves the sequentiality and contextual relationships of video content. We designed an intra-graph relational reasoning module with a dual-threshold graph convolution mechanism, which distinguishes semantically relevant frames from irrelevant ones between nodes. Additionally, our proposed language-guided cross-modal embedding module generates video summaries with specific textual descriptions. We model the summary generation output as a mixture of Bernoulli distribution and solve it with the EM algorithm. Experimental results show that our method outperforms existing approaches across multiple benchmarks. Moreover, we proposed LGRLN reduces inference time and model parameters by 87.8% and 91.7%, respectively. Our codes and pre-trained models are available at https://github.com/liwrui/LGRLN.

Language-Guided Graph Representation Learning for Video Summarization

TL;DR

This work tackles personalized video summarization by enabling language-guided, query-focused control while preserving global temporal and semantic dependencies. It introduces LGRLN, a lightweight graph-based framework that transforms video frames into forward, backward, and undirected graphs, applies a dual-threshold relational reasoning module, and fuses textual queries through a language-guided cross-modal embedding, trained with a mixture-of-Bernoulli EM objective. Empirical results on SumMe, TVSum, VideoXum, and QFVS demonstrate state-of-the-art or competitive performance with a fraction of large-language-model parameters, highlighting strong efficiency and scalability for resource-constrained settings. The approach offers practical impact for personalized, multimodal video summarization and sets a foundation for broader on-device, cross-modal video understanding with controlled fusion of language inputs.

Abstract

With the rapid growth of video content on social media, video summarization has become a crucial task in multimedia processing. However, existing methods face challenges in capturing global dependencies in video content and accommodating multimodal user customization. Moreover, temporal proximity between video frames does not always correspond to semantic proximity. To tackle these challenges, we propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization. Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies. By constructing forward, backward and undirected graphs, the video graph generator effectively preserves the sequentiality and contextual relationships of video content. We designed an intra-graph relational reasoning module with a dual-threshold graph convolution mechanism, which distinguishes semantically relevant frames from irrelevant ones between nodes. Additionally, our proposed language-guided cross-modal embedding module generates video summaries with specific textual descriptions. We model the summary generation output as a mixture of Bernoulli distribution and solve it with the EM algorithm. Experimental results show that our method outperforms existing approaches across multiple benchmarks. Moreover, we proposed LGRLN reduces inference time and model parameters by 87.8% and 91.7%, respectively. Our codes and pre-trained models are available at https://github.com/liwrui/LGRLN.

Paper Structure

This paper contains 34 sections, 14 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: The motivation of this paper. First, compared to RNN or attention-based methods, graph neural networks (GNNs) more naturally integrate multimodal information using heterogeneous graphs, while also reducing computational complexity as edge sparsity increases. Second, GNNs represent videos as graphs and conduct video summarization through node classification, providing strong interpretability. Finally, given that a video in the dataset may have multiple labels, directly averaging these labels could result in information loss, making it crucial to employ appropriate methods for processing distinct labels. To address label discrepancies in the TVSum and SumMe datasets, where cosine similarity and Hamming distance may vary between annotators, our method applies weighted averaging within an EM algorithm framework, improving label consistency and relevance in the generated summaries.
  • Figure 2: The overall architectures of proposed method. Input features from video and text modalities are fused using the Language-Guided Cross-Modal Embedding module. The fused input is then transformed into three graphs by the Graph Generator, which are subsequently fed into the three branches of the intra-graph relational reasoning Module. Outputs from all branches are combined by addition to generate the model's prediction. Finally, BCE loss is used to compute the overall loss between the prediction and the ground truths.
  • Figure 3: Learning convergence curves on mainstream datasets.
  • Figure 4: Learning convergence curves on QFVS dataset.
  • Figure 5: The results of manually annotated summary labels for several videos in the TVSum and SumMe datasets after ISOMAP dimensionality reduction and PCA dimensionality reduction.
  • ...and 3 more figures