Table of Contents
Fetching ...

Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning

Seung hee Choi, MinJu Jeon, Hyunwoo Oh, Jihwan Lee, Dong-Jin Kim

Abstract

Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation. We also propose to utilize the saliency scores as a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder. By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics. Our code is available at https://github.com/ermitaju1/STaRC

Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning

Abstract

Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation. We also propose to utilize the saliency scores as a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder. By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics. Our code is available at https://github.com/ermitaju1/STaRC
Paper Structure (49 sections, 17 equations, 15 figures, 8 tables)

This paper contains 49 sections, 17 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: An example of caption retrieval and generation by existing methods (e.g., Sali4Vid jeon2025sali4vid) and ours. The misaligned video segments in existing methods lead to less relevant retrieval results. Our supervised saliency scores are well-aligned with the ground truth, enabling more accurate retrieval. This alignment also provides the decoder with more precise and contextually appropriate information.
  • Figure 2: Correlation analysis between segment-quality metrics and captioning performance. The x-axis shows three indicators of temporal localization quality - Recall@0.5, Mean IoU, and Matched Segments - which respectively measure the proportion of correctly detected events (IoU $\ge 0.5$), the average overlap with ground truth segments, and the number of predictions aligned with the reference events. As these segment-quality metrics improve from Sali4Vid jeon2025sali4vid to ours, downstream DVC performances (e.g., CIDEr, METEOR) also rise consistently, demonstrating a strong positive correlation between accurate event segmentation and caption generation performance.
  • Figure 3: Overview of our STaRC framework for DVC. A SWSA module refines video features, and a highlight detection module is supervised using binary highlight labels derived from existing DVC annotations to predict frame-level saliency scores. The SGSR module then performs OT-based clustering guided by these saliency scores to form coherent retrieval segments. In addition, the SaliP integrates saliency into the decoder’s attention for saliency-aware caption generation. STaRC unifies retrieval and caption generation using saliency signals learned in a supervised manner, ensuring consistent alignment between visual context and semantic description.
  • Figure 4: (a) Effect of anchor count and top-$k$ segment selection. CIDEr scores across different numbers of anchors ($K$) and top-$k$ segment selections, where only the $k$ highest-scoring segments are used for retrieval. Each line represents a different anchor configuration, and $K$ corresponds to the number of segments produced by SGSR. (b-d) Impact of saliency-prompt corruption. These plots show the effect of replacing SaliP with Zero and Gaussian noise prompts. Both corrupted prompts still outperform the baseline (STaRC without SaliP), but Gaussian noise causes a larger drop, emphasizing the importance of accurate saliency cues.
  • Figure 5: A qualitative result from YouCook2 validation set. The arrows indicate the duration of each localized event, and the text below each arrow represents its corresponding caption.
  • ...and 10 more figures