Table of Contents
Fetching ...

CSTA: CNN-based Spatiotemporal Attention for Video Summarization

Jaewon Son, Jaehun Park, Kwangsu Kim

TL;DR

Video summarization requires capturing essential moments while reducing duration. This work introduces CNN-based SpatioTemporal Attention (CSTA), which treats stacked frame features as image-like inputs and uses a 2D CNN to learn inter- and intra-frame relations along with absolute positional cues, augmented by a CLS token and adaptive pooling. CSTA achieves state-of-the-art performance on SumMe and competitive results on TVSum with significantly fewer MACs, demonstrating efficiency gains from a single CNN-driven attention mechanism rather than additional spatial modules. The approach is validated through extensive ablations, verification that CNNs can generate attention maps from frame features, and a comprehensive MACs comparison, underscoring the practical impact of CNN-based attention in video summarization.

Abstract

Video summarization aims to generate a concise representation of a video, capturing its essential content and key moments while reducing its overall length. Although several methods employ attention mechanisms to handle long-term dependencies, they often fail to capture the visual significance inherent in frames. To address this limitation, we propose a CNN-based SpatioTemporal Attention (CSTA) method that stacks each feature of frames from a single video to form image-like frame representations and applies 2D CNN to these frame features. Our methodology relies on CNN to comprehend the inter and intra-frame relations and to find crucial attributes in videos by exploiting its ability to learn absolute positions within images. In contrast to previous work compromising efficiency by designing additional modules to focus on spatial importance, CSTA requires minimal computational overhead as it uses CNN as a sliding window. Extensive experiments on two benchmark datasets (SumMe and TVSum) demonstrate that our proposed approach achieves state-of-the-art performance with fewer MACs compared to previous methods. Codes are available at https://github.com/thswodnjs3/CSTA.

CSTA: CNN-based Spatiotemporal Attention for Video Summarization

TL;DR

Video summarization requires capturing essential moments while reducing duration. This work introduces CNN-based SpatioTemporal Attention (CSTA), which treats stacked frame features as image-like inputs and uses a 2D CNN to learn inter- and intra-frame relations along with absolute positional cues, augmented by a CLS token and adaptive pooling. CSTA achieves state-of-the-art performance on SumMe and competitive results on TVSum with significantly fewer MACs, demonstrating efficiency gains from a single CNN-driven attention mechanism rather than additional spatial modules. The approach is validated through extensive ablations, verification that CNNs can generate attention maps from frame features, and a comprehensive MACs comparison, underscoring the practical impact of CNN-based attention in video summarization.

Abstract

Video summarization aims to generate a concise representation of a video, capturing its essential content and key moments while reducing its overall length. Although several methods employ attention mechanisms to handle long-term dependencies, they often fail to capture the visual significance inherent in frames. To address this limitation, we propose a CNN-based SpatioTemporal Attention (CSTA) method that stacks each feature of frames from a single video to form image-like frame representations and applies 2D CNN to these frame features. Our methodology relies on CNN to comprehend the inter and intra-frame relations and to find crucial attributes in videos by exploiting its ability to learn absolute positions within images. In contrast to previous work compromising efficiency by designing additional modules to focus on spatial importance, CSTA requires minimal computational overhead as it uses CNN as a sliding window. Extensive experiments on two benchmark datasets (SumMe and TVSum) demonstrate that our proposed approach achieves state-of-the-art performance with fewer MACs compared to previous methods. Codes are available at https://github.com/thswodnjs3/CSTA.
Paper Structure (37 sections, 11 equations, 6 figures, 13 tables, 1 algorithm)

This paper contains 37 sections, 11 equations, 6 figures, 13 tables, 1 algorithm.

Figures (6)

  • Figure 1: Approaches for calculating attention. Each row is the feature vector of a frame. T is the number of frames, and D is the dimension of the feature.
  • Figure 2: Workflow of CSTA
  • Figure 3: Architecture of CSTA
  • Figure 4: Comparison of summarizing performance between CNN and video summarization models. The x-axis shows performance, and the y-axis shows model names. Based on the dashed line, the performance of CNN is displayed above, and the video summarization models are below.
  • Figure 5: Visualization and comparison of summary videos generated by different models. The images above are the frames selected by CSTA as parts of the summary video. The graphs below show which frames models pick as keyframes. From the graphs, each row is the result of each model. The x-axis is the order of the frames, and the black boxes are the ground truth frames. The color parts are the frames each model selects, and the white parts are the frames unselected by each model.
  • ...and 1 more figures