Table of Contents
Fetching ...

Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning

Zhuyang Xie, Yan Yang, Yankai Yu, Jie Wang, Yongquan Jiang, Xiao Wu

TL;DR

MCCL introduces a dense video captioning framework that fuses video-to-text retrieval, frame-level multi-concept detection, and a cyclic co-learning loop between a deformable transformer generator and a localizer to jointly enhance semantic perception and precise event localization. It employs weakly supervised MIL for frame concepts, a concept contrastive loss, and a cyclic loss that combines semantic matching, location matching, and semantic guidance to mutually reinforce captioning and localization. The method achieves state-of-the-art results on ActivityNet Captions and YouCook2 without extra pretraining, with notable gains in CIDEr and related metrics and enhanced interpretability through explicit concept cues. These contributions demonstrate the value of integrating cross-modal retrieval, temporal concept cues, and cyclic optimization for robust dense video understanding.

Abstract

Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level, using these concepts to enhance video features and provide temporal event cues; and (2) design cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, we perform weakly supervised concept detection for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to obtain more discriminative concept embeddings. In the captioning network, we establish a cyclic co-learning strategy where the generator guides the localizer for event localization through semantic matching, while the localizer enhances the generator's event semantic perception through location matching, making semantic perception and event localization mutually beneficial. MCCL achieves state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets. Extensive experiments demonstrate its effectiveness and interpretability.

Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning

TL;DR

MCCL introduces a dense video captioning framework that fuses video-to-text retrieval, frame-level multi-concept detection, and a cyclic co-learning loop between a deformable transformer generator and a localizer to jointly enhance semantic perception and precise event localization. It employs weakly supervised MIL for frame concepts, a concept contrastive loss, and a cyclic loss that combines semantic matching, location matching, and semantic guidance to mutually reinforce captioning and localization. The method achieves state-of-the-art results on ActivityNet Captions and YouCook2 without extra pretraining, with notable gains in CIDEr and related metrics and enhanced interpretability through explicit concept cues. These contributions demonstrate the value of integrating cross-modal retrieval, temporal concept cues, and cyclic optimization for robust dense video understanding.

Abstract

Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level, using these concepts to enhance video features and provide temporal event cues; and (2) design cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, we perform weakly supervised concept detection for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to obtain more discriminative concept embeddings. In the captioning network, we establish a cyclic co-learning strategy where the generator guides the localizer for event localization through semantic matching, while the localizer enhances the generator's event semantic perception through location matching, making semantic perception and event localization mutually beneficial. MCCL achieves state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets. Extensive experiments demonstrate its effectiveness and interpretability.

Paper Structure

This paper contains 37 sections, 16 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: (a) The two-stage approach predicts multiple event candidates and eliminates redundancy using non-maximum suppression. (b) Learnable event queries are embedded in the generator (visual transformer), with two prediction heads for localization and captioning. (c) MCCL uses cyclic co-learning, where the generator, localizer, and descriptor enhance performance collaboratively by leveraging the mutual benefits of captioning and localization.
  • Figure 2: Overview of the framework. (a) A pretrained image encoder extracts video features and performs cross-modal retrieval to obtain sentence features. (b) Video-level and frame-level concepts are detected via multiple instance learning. (c) The features are fed into the generator to update event queries. The localizer predicts locations for each query and selects the optimal ones, while the descriptor produces captions based on these optimal queries. The generator and localizer co-learn in a cycle.
  • Figure 3: Cyclic co-learning.
  • Figure 4: Concept guidance for video captioning.
  • Figure 5: Qualitative captioning results. Two examples from ActivityNet Captions. The ground truth and generated captions for each video are shown separately.
  • ...and 3 more figures