SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Ye-Chan Kim; SeungJu Cha; Si-Woo Kim; Minju Jeon; Hyungee Kim; Dong-Jin Kim

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Minju Jeon, Hyungee Kim, Dong-Jin Kim

TL;DR

SAIL is proposed, which constructs semantically-aware masks through cross-modal alignment, and an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals is introduced to guide more accurate mask generation under sparse annotation settings.

Abstract

Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

TL;DR

Abstract

Paper Structure (20 sections, 4 equations, 11 figures, 9 tables)

This paper contains 20 sections, 4 equations, 11 figures, 9 tables.

Introduction
Related Work
Dense Video Captioning
Weakly-Supervised Dense Video Captioning
Method
Preliminaries
Similarity-Aware Mask Guide
LLM-Based Caption Augmentation
Experiments
Experimental Settings.
Comparison with State-of-the-Art.
Ablation Studies.
Conclusion
Implementation Details
Inference Details
...and 5 more sections

Figures (11)

Figure 1: (a) Previous work generates masks that simply cover different temporal regions without considering semantic alignment with corresponding events. (b) Our proposed method leverages cross-modal similarity to guide masks toward event-relevant regions and addresses caption sparsity through LLM augmentation.
Figure 2: (a) The fixed mask baseline applies Gaussian masks with equal width uniformly distributed according to the number of events. It performs comparably to the existing method ge2025implicit. (b) shows the dataset's annotation sparsity example, where potential events are missed. (c) The majority of video samples have events spanning the entire duration but contain only a small number of events (red box).
Figure 3: $\textit{SAIL}$ Pipeline. Our method exploits cross-modal similarity to guide mask optimization toward increased alignment with event captions and further enriches supervision through LLM-generated synthetic captions.
Figure 4: Impact of caption density on model performance. Performance decreases consistently as annotation density reduces from 100% to 25% (left), highlighting annotation sparsity as a critical challenge. Densifying supervision through LLM-generated synthetic captions (right) improves performance.
Figure 5: Synthetic caption's qualitative results. Our synthetic captions effectively capture potential intermediate events occurring between consecutive ground-truth annotations.
...and 6 more figures

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

TL;DR

Abstract

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)