Table of Contents
Fetching ...

Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning

Seung Hyup Baek, Jimin Lee, Hyeongkeun Lee, Jae Won Cho

Abstract

Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.

Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning

Abstract

Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.
Paper Structure (17 sections, 8 equations, 4 figures, 7 tables)

This paper contains 17 sections, 8 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparison of our method with previous methods. (a) Captioning results. The baseline model captures overlapping events, causing redundant captions. In contrast, our model generates distinct non overlapping regions for a set caption. (b) Decoder query sampled attention weights. The baseline's single query attends indiscriminately. Previous "decomposition" queries show similar attention distributions. In contrast, our method employs two specialized queries: localization queries attend broadly for boundaries and a caption queries attend densely on key frames.
  • Figure 2: An overview of our proposed ROS-DVC framework. The input video is first fed into the pretrained encoder, and a transformer encoder processes it to generate frame-level features. In the decoding stage, two types of queries are independently initialized and retrieve their role-specific information from the frame-level features. The output localization queries are trained with the Overlap Suppression Loss to minimize mutual overlap and are matched with ground truths via the Hungarian algorithm. Subsequently, the CTCA loss is employed to semantically align the caption queries with their corresponding localization queries. Finally, these processed queries are fed into respective heads to obtain the predictions for event-number, localize timestamps, event captions, and event-level concepts.
  • Figure 3: Mechanism of the Overlap Suppression Loss. This loss discourages overlap among queries. The suppression strength is inversely modulated by the query's IoU with the ground truth (GT). Queries with high GT IoU receive weak suppression, while queries with low GT IoU receive strong suppression.
  • Figure 4: Qualitative results of dense video captioning on YouCook2. We compare the localization and captioning results with the ground truth, the baseline (PDVC), and ours. Each arrow represents localization boundaries and its corresponding caption is provided below. We find that while PDVC captures redundant events with the same captions, our model avoids this phenomenon.