Table of Contents
Fetching ...

VideoSAGE: Video Summarization with Graph Representation Learning

Jose M. Rojas Chaves, Subarna Tripathi

TL;DR

VideoSAGE addresses efficient video summarization by constructing a sparse graph where each node is a frame and edges connect temporally close frames; the summarization task is cast as binary node classification on this graph. The model uses three parallel GNNs (forward, backward, undirected) with shared second-layer weights to capture short- and long-range temporal dependencies, with initial edges formed by a maximum time distance $T$. Experiments on SumMe and TVSum show that VideoSAGE achieves comparable or better correlation metrics (Kendall’s τ and Spearman’s ρ) than state-of-the-art methods while offering an order-of-magnitude improvement in inference time and memory footprint. The approach demonstrates practical scalability for large video collections and invites further exploration of sparse graph representations for video understanding.

Abstract

We propose a graph-based representation learning framework for video summarization. First, we convert an input video to a graph where nodes correspond to each of the video frames. Then, we impose sparsity on the graph by connecting only those pairs of nodes that are within a specified temporal distance. We then formulate the video summarization task as a binary node classification problem, precisely classifying video frames whether they should belong to the output summary video. A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck. Experiments on two datasets(SumMe and TVSum) demonstrate the effectiveness of the proposed nimble model compared to existing state-of-the-art summarization approaches while being one order of magnitude more efficient in compute time and memory

VideoSAGE: Video Summarization with Graph Representation Learning

TL;DR

VideoSAGE addresses efficient video summarization by constructing a sparse graph where each node is a frame and edges connect temporally close frames; the summarization task is cast as binary node classification on this graph. The model uses three parallel GNNs (forward, backward, undirected) with shared second-layer weights to capture short- and long-range temporal dependencies, with initial edges formed by a maximum time distance . Experiments on SumMe and TVSum show that VideoSAGE achieves comparable or better correlation metrics (Kendall’s τ and Spearman’s ρ) than state-of-the-art methods while offering an order-of-magnitude improvement in inference time and memory footprint. The approach demonstrates practical scalability for large video collections and invites further exploration of sparse graph representations for video understanding.

Abstract

We propose a graph-based representation learning framework for video summarization. First, we convert an input video to a graph where nodes correspond to each of the video frames. Then, we impose sparsity on the graph by connecting only those pairs of nodes that are within a specified temporal distance. We then formulate the video summarization task as a binary node classification problem, precisely classifying video frames whether they should belong to the output summary video. A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck. Experiments on two datasets(SumMe and TVSum) demonstrate the effectiveness of the proposed nimble model compared to existing state-of-the-art summarization approaches while being one order of magnitude more efficient in compute time and memory
Paper Structure (14 sections, 6 figures, 4 tables)

This paper contains 14 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: VideoSAGE constructs a graph from the input video, where each node corresponds to a video frame. Only those pairs of nodes are connected to each other who are within a temporal distance. Video summarization is thus formulated as a binary node classification problem for that graph. Our constructed graph has forward, backward and bidirectional edges. For visual clarity, we only show bi-directional edges in this figure. From top row to bottom row, the figure shows how regular input video is converted to a sparse graph, followed by binary node classification on nodes leading to summarized output video.
  • Figure 2: An illustration of utilized Bi-directional (a.k.a. Bi-dir) GNN model for video summarization. Here, we have three separate GNN modules for the forward, backward, and undirected graph, respectively. Each module has three layers where the weight of the second layer is shared by all three graph modules. The second layer is placed inside a solid-lined box to indicate the weight sharing while for the first and the third layer we use dotted-lines. E-CONV and S-CONV are shorthand for EDGECONV and SAGE-CONV, respectively.
  • Figure 3: TVSum/Video_16: Snapshots comparing shots from VideoSAGE (ours) predicted summary (Top) and a GT summary build from ground truth scores (Bottom).
  • Figure 4: SumMe/Video_1: Comparison of importance scores and selected summary segments for VideoSAGE (ours), a ground truth, and PGL-SUM pglsum2021.
  • Figure 5: TVSum/Video_16: Comparison of importance scores and selected summary segments for VideoSAGE (ours), a ground truth, and PGL-SUM pglsum2021.
  • ...and 1 more figures