Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning
Zhuyang Xie, Yan Yang, Yankai Yu, Jie Wang, Yongquan Jiang, Xiao Wu
TL;DR
MCCL introduces a dense video captioning framework that fuses video-to-text retrieval, frame-level multi-concept detection, and a cyclic co-learning loop between a deformable transformer generator and a localizer to jointly enhance semantic perception and precise event localization. It employs weakly supervised MIL for frame concepts, a concept contrastive loss, and a cyclic loss that combines semantic matching, location matching, and semantic guidance to mutually reinforce captioning and localization. The method achieves state-of-the-art results on ActivityNet Captions and YouCook2 without extra pretraining, with notable gains in CIDEr and related metrics and enhanced interpretability through explicit concept cues. These contributions demonstrate the value of integrating cross-modal retrieval, temporal concept cues, and cyclic optimization for robust dense video understanding.
Abstract
Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level, using these concepts to enhance video features and provide temporal event cues; and (2) design cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, we perform weakly supervised concept detection for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to obtain more discriminative concept embeddings. In the captioning network, we establish a cyclic co-learning strategy where the generator guides the localizer for event localization through semantic matching, while the localizer enhances the generator's event semantic perception through location matching, making semantic perception and event localization mutually beneficial. MCCL achieves state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets. Extensive experiments demonstrate its effectiveness and interpretability.
