Table of Contents
Fetching ...

Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction

Mingda Jia, Weiliang Meng, Zenghuang Fu, Yiheng Li, Qi Zeng, Yifan Zhang, Ju Xin, Rongtao Xu, Jiguang Zhang, Xiaopeng Zhang

TL;DR

This work addresses dense video captioning by eliminating implicit, fragmentary temporal modeling and introducing explicit temporal-semantic control through CACMI. The method combines Cross-modal Frame Aggregation to form temporally coherent pseudo-events and Event Semantic Retrieval with a sentence bank, with Context-aware Feature Enhancement to bridge visual and textual modalities. A deformable transformer-based predictor jointly outputs event boundaries, captions, and an event count, trained with a four-component loss and Hungarian matching. Empirically, CACMI achieves state-of-the-art performance on ActivityNet Captions and strong results on YouCook2, demonstrating superior event localization and coherent multi-event narration, thus enabling more accurate and context-rich video understanding in real-world, untrimmed videos.

Abstract

Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.

Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction

TL;DR

This work addresses dense video captioning by eliminating implicit, fragmentary temporal modeling and introducing explicit temporal-semantic control through CACMI. The method combines Cross-modal Frame Aggregation to form temporally coherent pseudo-events and Event Semantic Retrieval with a sentence bank, with Context-aware Feature Enhancement to bridge visual and textual modalities. A deformable transformer-based predictor jointly outputs event boundaries, captions, and an event count, trained with a four-component loss and Hungarian matching. Empirically, CACMI achieves state-of-the-art performance on ActivityNet Captions and strong results on YouCook2, demonstrating superior event localization and coherent multi-event narration, thus enabling more accurate and context-rich video understanding in real-world, untrimmed videos.

Abstract

Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.

Paper Structure

This paper contains 30 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (a) CM$^{2}$ introduces a cross-modal memory-based model, the external sentence bank is specifically designed to select relevant implicit semantics. (b) Our CACMI harnesses explicit temporal-semantic information through context-aware cross-modal interaction to enhance the event localization and captioning performance.
  • Figure 2: The overview of our CACMI framework. We employs a retrieval-augmented generation paradigm for DVC task. The pipeline begins with a pretrained CLIP image encoder extracting frame-level features. (a) Cross-modal Frame Aggregation (CFA). This module comprises two synergistic components: Event Context Clustering aggregates temporally and semantically consistent frame features to generate clustered event representations, and Event Semantic Retrieval matches relevant semantic information from a sentence bank via cosine similarity to produce retrieval-enhanced semantic features. (b) Context-aware Feature Enhancement (CFE). This module facilitates cross-modal interaction between retrieved textual features and visual representations, bridging the modality gap to generate enhanced frame features. Finally, a deformable transformer equipped with multi-task heads generates the joint outputs of event localization and captioning.
  • Figure 3: Visualization of event features. The t-SNE projection illustrates a two-dimensional embedding space, where grouped points within the same cluster indicate temporal correlation and semantic similarity. This demonstrates that the frame aggregation module effectively constructs discriminative event representations while preserving meaningful temporal information.
  • Figure 4: Visualizations of dense event captioning prediction on ActivityNet Captions. We present the results of the ground truth, the baseline CM$^{2}$ and our method.