Your Interest, Your Summaries: Query-Focused Long Video Summarization
Nirav Patel, Payal Prajapati, Maitrik Shah
TL;DR
This paper tackles query-focused video summarization for long videos by introducing FCSNA-QFVS, a Fully Convolutional Sequence Network augmented with Local Self-Attention, Query-Guided Segment-Level Attention, and Global Attention. The architecture comprises a feature learning module based on eight 1D temporal convolution blocks, a three-attention fusion stage, a deconvolution layer to restore temporal length, and a shot scoring module that predicts per-shot relevance to a user query, selecting the top 2% shots for the final summary. Evaluated on a benchmark egocentric QFVS dataset, the method achieves higher F1 scores than prior approaches and provides qualitative analyses illustrating query relevance and shot selection. Overall, the approach enables efficient, parallelizable processing of long videos with improved alignment to user queries, enhancing the usefulness of automated summaries in real-world applications.
Abstract
Generating a concise and informative video summary from a long video is important, yet subjective due to varying scene importance. Users' ability to specify scene importance through text queries enhances the relevance of such summaries. This paper introduces an approach for query-focused video summarization, aiming to align video summaries closely with user queries. To this end, we propose the Fully Convolutional Sequence Network with Attention (FCSNA-QFVS), a novel approach designed for this task. Leveraging temporal convolutional and attention mechanisms, our model effectively extracts and highlights relevant content based on user-specified queries. Experimental validation on a benchmark dataset for query-focused video summarization demonstrates the effectiveness of our approach.
