EDSNet: Efficient-DSNet for Video Summarization

Ashish Prasad; Pranav Jeevan; Amit Sethi

EDSNet: Efficient-DSNet for Video Summarization

Ashish Prasad, Pranav Jeevan, Amit Sethi

TL;DR

This work enhances the Direct-to-Summarize Network (DSNet) with more resource-efficient token mixing mechanisms, and shows that replacing traditional attention with alternatives like Fourier, Wavelet transforms, and Nystromformer improves efficiency and performance.

Abstract

Current video summarization methods largely rely on transformer-based architectures, which, due to their quadratic complexity, require substantial computational resources. In this work, we address these inefficiencies by enhancing the Direct-to-Summarize Network (DSNet) with more resource-efficient token mixing mechanisms. We show that replacing traditional attention with alternatives like Fourier, Wavelet transforms, and Nyströmformer improves efficiency and performance. Furthermore, we explore various pooling strategies within the Regional Proposal Network, including ROI pooling, Fast Fourier Transform pooling, and flat pooling. Our experimental results on TVSum and SumMe datasets demonstrate that these modifications significantly reduce computational costs while maintaining competitive summarization performance. Thus, our work offers a more scalable solution for video summarization tasks.

EDSNet: Efficient-DSNet for Video Summarization

TL;DR

Abstract

Paper Structure (15 sections, 5 figures, 2 tables)

This paper contains 15 sections, 5 figures, 2 tables.

Introduction
Related Work
Approach
Feature Extraction
Region Proposal Network
Feature Extraction for Segment Proposals
Classification and Localization
Experiments
Datasets
Implementation Details
Results and Discussions
Ablation Studies
Segment Length
Fully connected layer depth analysis
Conclusion

Figures (5)

Figure 1: Plot comparing model accuracy (F1 %) versus number of parameters for TVSum and SumMe datastes shows that EDSNet models outperform others while remaining parameter efficient. EDSNet models are circled.
Figure 2: The model architecture of EDSNet illustrates the video summarization process, starting with a CNN feature extractor and a token-mixer for feature extraction. The outputs are refined using a fully connected layer, followed by region proposal generation and segment feature extraction. Finally, classification and localization are performed through fully connected layers to provide classification scores and segment boundary offsets, enabling accurate summarization and temporal localization of important video segments.
Figure 3: DWT token-mixer module uses the 1-D DWT for token-mixing in video frame feature extraction, decomposing inputs into approximation and detail coefficients. It employs normalization, and 1D-transposed convolutions to stabilize training and refine temporal resolution. $N$ is the number of frames and $F$ is the feature dimension
Figure 4: The segment feature extractor applies pooling operations along the temporal dimension of each segment, which is then flattened and averaged to obtain coarse features and passed through a fully connected layer with ReLU activation to extract fine-grained features.
Figure 5: Comparison of Accuracy for Different token-mixing Methods at Varying FC Depths for SumMe and TVSum Datasets.

EDSNet: Efficient-DSNet for Video Summarization

TL;DR

Abstract

EDSNet: Efficient-DSNet for Video Summarization

Authors

TL;DR

Abstract

Table of Contents

Figures (5)