Table of Contents
Fetching ...

Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Liu Qin, Ziyao Chen, Qing Gu

TL;DR

The paper tackles weakly-supervised dense video captioning by eliminating heavy event proposal steps and enabling implicit alignment between event locations and captions through complementary masking. It introduces a dual-mode dense video captioning model and a mask-generation module that produce differentiable Gaussian masks, enabling positive and negative masked captioning to cohere into a complete video description. The approach achieves state-of-the-art results among weakly-supervised methods on ActivityNet and competitive performance with fully-supervised methods, especially when paired with strong backbones like CLIP. This method reduces supervision requirements while maintaining high-quality localization and captioning, with broad practical implications for scalable video understanding and search.

Abstract

Weakly-Supervised Dense Video Captioning (WSDVC) aims to localize and describe all events of interest in a video without requiring annotations of event boundaries. This setting poses a great challenge in accurately locating the temporal location of event, as the relevant supervision is unavailable. Existing methods rely on explicit alignment constraints between event locations and captions, which involve complex event proposal procedures during both training and inference. To tackle this problem, we propose a novel implicit location-caption alignment paradigm by complementary masking, which simplifies the complex event proposal and localization process while maintaining effectiveness. Specifically, our model comprises two components: a dual-mode video captioning module and a mask generation module. The dual-mode video captioning module captures global event information and generates descriptive captions, while the mask generation module generates differentiable positive and negative masks for localizing the events. These masks enable the implicit alignment of event locations and captions by ensuring that captions generated from positively and negatively masked videos are complementary, thereby forming a complete video description. In this way, even under weak supervision, the event location and event caption can be aligned implicitly. Extensive experiments on the public datasets demonstrate that our method outperforms existing weakly-supervised methods and achieves competitive results compared to fully-supervised methods.

Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

TL;DR

The paper tackles weakly-supervised dense video captioning by eliminating heavy event proposal steps and enabling implicit alignment between event locations and captions through complementary masking. It introduces a dual-mode dense video captioning model and a mask-generation module that produce differentiable Gaussian masks, enabling positive and negative masked captioning to cohere into a complete video description. The approach achieves state-of-the-art results among weakly-supervised methods on ActivityNet and competitive performance with fully-supervised methods, especially when paired with strong backbones like CLIP. This method reduces supervision requirements while maintaining high-quality localization and captioning, with broad practical implications for scalable video understanding and search.

Abstract

Weakly-Supervised Dense Video Captioning (WSDVC) aims to localize and describe all events of interest in a video without requiring annotations of event boundaries. This setting poses a great challenge in accurately locating the temporal location of event, as the relevant supervision is unavailable. Existing methods rely on explicit alignment constraints between event locations and captions, which involve complex event proposal procedures during both training and inference. To tackle this problem, we propose a novel implicit location-caption alignment paradigm by complementary masking, which simplifies the complex event proposal and localization process while maintaining effectiveness. Specifically, our model comprises two components: a dual-mode video captioning module and a mask generation module. The dual-mode video captioning module captures global event information and generates descriptive captions, while the mask generation module generates differentiable positive and negative masks for localizing the events. These masks enable the implicit alignment of event locations and captions by ensuring that captions generated from positively and negatively masked videos are complementary, thereby forming a complete video description. In this way, even under weak supervision, the event location and event caption can be aligned implicitly. Extensive experiments on the public datasets demonstrate that our method outperforms existing weakly-supervised methods and achieves competitive results compared to fully-supervised methods.

Paper Structure

This paper contains 47 sections, 22 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparison of our complementary masking paradigm with previous paradigms for event localization.
  • Figure 2: Illustration of our proposed framework, which consists of two main components: a Dense Video Captioning model for event captioning and a Complementary Mask Generation module for event localization.
  • Figure 3: Detailed illustration of Event Location Prediction and Differentiable Mask Construction.
  • Figure 4: Impact of $\tau$ in the Gaussian mask construction (a-b) and impact of $\gamma$ in the diversity loss (c-d).
  • Figure 5: A Qualitative Example from Activity Caption.
  • ...and 2 more figures