Table of Contents
Fetching ...

Cluster-based Video Summarization with Temporal Context Awareness

Hai-Dang Huynh-Lam, Ngoc-Phuong Ho-Thi, Minh-Triet Tran, Trung-Nghia Le

TL;DR

The paper targets unsupervised video summarization and the lack of temporal coherence in traditional cluster-based methods. It introduces TAC-SUM, a training-free pipeline that embeds temporal context into clustering by converting global contextual embeddings into temporally aware semantic partitions, followed by simple, rule-based keyframe selection and frame importance scoring. On SumMe, TAC-SUM outperforms existing unsupervised approaches and remains competitive with several supervised methods, highlighting a practical, interpretable alternative that avoids labeled data. By leveraging contextual embeddings from pre-trained models, hierarchical clustering, and partition-based scoring, TAC-SUM achieves efficient, scalable summarization with transparent decision processes.

Abstract

In this paper, we present TAC-SUM, a novel and efficient training-free approach for video summarization that addresses the limitations of existing cluster-based models by incorporating temporal context. Our method partitions the input video into temporally consecutive segments with clustering information, enabling the injection of temporal awareness into the clustering process, setting it apart from prior cluster-based summarization methods. The resulting temporal-aware clusters are then utilized to compute the final summary, using simple rules for keyframe selection and frame importance scoring. Experimental results on the SumMe dataset demonstrate the effectiveness of our proposed approach, outperforming existing unsupervised methods and achieving comparable performance to state-of-the-art supervised summarization techniques. Our source code is available for reference at \url{https://github.com/hcmus-thesis-gulu/TAC-SUM}.

Cluster-based Video Summarization with Temporal Context Awareness

TL;DR

The paper targets unsupervised video summarization and the lack of temporal coherence in traditional cluster-based methods. It introduces TAC-SUM, a training-free pipeline that embeds temporal context into clustering by converting global contextual embeddings into temporally aware semantic partitions, followed by simple, rule-based keyframe selection and frame importance scoring. On SumMe, TAC-SUM outperforms existing unsupervised approaches and remains competitive with several supervised methods, highlighting a practical, interpretable alternative that avoids labeled data. By leveraging contextual embeddings from pre-trained models, hierarchical clustering, and partition-based scoring, TAC-SUM achieves efficient, scalable summarization with transparent decision processes.

Abstract

In this paper, we present TAC-SUM, a novel and efficient training-free approach for video summarization that addresses the limitations of existing cluster-based models by incorporating temporal context. Our method partitions the input video into temporally consecutive segments with clustering information, enabling the injection of temporal awareness into the clustering process, setting it apart from prior cluster-based summarization methods. The resulting temporal-aware clusters are then utilized to compute the final summary, using simple rules for keyframe selection and frame importance scoring. Experimental results on the SumMe dataset demonstrate the effectiveness of our proposed approach, outperforming existing unsupervised methods and achieving comparable performance to state-of-the-art supervised summarization techniques. Our source code is available for reference at \url{https://github.com/hcmus-thesis-gulu/TAC-SUM}.
Paper Structure (21 sections, 6 figures, 3 tables)

This paper contains 21 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Pipeline of the proposed approach showcasing four modules and information flow across main stages.
  • Figure 2: Visual illustration of contextual information.
  • Figure 3: Overall pipeline for the Contextual Clustering step.
  • Figure 4: Comparison between cosine-interpolated scores and flat scores are demonstrated for two examples.
  • Figure 5: Comparison of importance scores between user-annotated scores and scores generated by the proposed method under the unbiased flat rule as well as the biased cosine rule.
  • ...and 1 more figures