Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction
Mingda Jia, Weiliang Meng, Zenghuang Fu, Yiheng Li, Qi Zeng, Yifan Zhang, Ju Xin, Rongtao Xu, Jiguang Zhang, Xiaopeng Zhang
TL;DR
This work addresses dense video captioning by eliminating implicit, fragmentary temporal modeling and introducing explicit temporal-semantic control through CACMI. The method combines Cross-modal Frame Aggregation to form temporally coherent pseudo-events and Event Semantic Retrieval with a sentence bank, with Context-aware Feature Enhancement to bridge visual and textual modalities. A deformable transformer-based predictor jointly outputs event boundaries, captions, and an event count, trained with a four-component loss and Hungarian matching. Empirically, CACMI achieves state-of-the-art performance on ActivityNet Captions and strong results on YouCook2, demonstrating superior event localization and coherent multi-event narration, thus enabling more accurate and context-rich video understanding in real-world, untrimmed videos.
Abstract
Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.
