Table of Contents
Fetching ...

Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions

Chan Hur, Jeong-hun Hong, Dong-hun Lee, Dabin Kang, Semin Myeong, Sang-hyo Park, Hyeyoung Park

TL;DR

NarVid tackles the cross-modality and temporal challenges of text-to-video retrieval by exploiting frame-level narration captions (the narration) generated for each video frame. It introduces four integrated components: (1) cross-modal feature enhancement via video–narration co-attention and temporal modeling, (2) query-aware adaptive filtering to keep query-relevant content, (3) dual matching signals (query–video and query–narration) with multi-granularity scoring, and (4) a cross-view hard negative loss to improve discriminability. The training objective combines $L_{NCE}$ with a cross-view hard negative loss to learn from both inter- and intra-modality signals, while inference fuses standardized similarity matrices to balance video and narration contributions. Experiments on MSR-VTT, MSVD, VATEX, and DiDeMo show state-of-the-art results and robust gains across datasets, with ablations confirming the effectiveness of each module and the benefits of frame-level narration over video-level captions. The work demonstrates that rich, temporally aligned narration can substantially improve retrieval performance in multimodal video understanding, though it relies on caption-generation quality and pre-computation of narration.

Abstract

In recent text-video retrieval, the use of additional captions from vision-language models has shown promising effects on the performance. However, existing models using additional captions often have struggled to capture the rich semantics, including temporal changes, inherent in the video. In addition, incorrect information caused by generative models can lead to inaccurate retrieval. To address these issues, we propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the two similarities from different views. Experimental results demonstrate that NarVid achieves state-of-the-art performance on various benchmark datasets.

Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions

TL;DR

NarVid tackles the cross-modality and temporal challenges of text-to-video retrieval by exploiting frame-level narration captions (the narration) generated for each video frame. It introduces four integrated components: (1) cross-modal feature enhancement via video–narration co-attention and temporal modeling, (2) query-aware adaptive filtering to keep query-relevant content, (3) dual matching signals (query–video and query–narration) with multi-granularity scoring, and (4) a cross-view hard negative loss to improve discriminability. The training objective combines with a cross-view hard negative loss to learn from both inter- and intra-modality signals, while inference fuses standardized similarity matrices to balance video and narration contributions. Experiments on MSR-VTT, MSVD, VATEX, and DiDeMo show state-of-the-art results and robust gains across datasets, with ablations confirming the effectiveness of each module and the benefits of frame-level narration over video-level captions. The work demonstrates that rich, temporally aligned narration can substantially improve retrieval performance in multimodal video understanding, though it relies on caption-generation quality and pre-computation of narration.

Abstract

In recent text-video retrieval, the use of additional captions from vision-language models has shown promising effects on the performance. However, existing models using additional captions often have struggled to capture the rich semantics, including temporal changes, inherent in the video. In addition, incorrect information caused by generative models can lead to inaccurate retrieval. To address these issues, we propose a new framework, Narrating the Video (NarVid), which strategically leverages the comprehensive information available from frame-level captions, the narration. The proposed NarVid exploits narration in multiple ways: 1) feature enhancement through cross-modal interactions between narration and video, 2) query-aware adaptive filtering to suppress irrelevant or incorrect information, 3) dual-modal matching score by adding query-video similarity and query-narration similarity, and 4) hard-negative loss to learn discriminative features from multiple perspectives using the two similarities from different views. Experimental results demonstrate that NarVid achieves state-of-the-art performance on various benchmark datasets.

Paper Structure

This paper contains 33 sections, 13 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: (a) Unlike video-level caption information that summarizes the entire video, (b) the proposed framework utilizes frame-level captions to capture time-varying rich information and efficiently leverage selective information based on specific queries.
  • Figure 2: An overview of the proposed framework, NarVid. The method first generates frame-level captions (narration) for each video. (a) Using the frame-level features of the video and narration, enhanced features are obtained through cross-modal interaction with co-attention and temporal block. (b) These enhanced features are further refined using query-aware adaptive filtering. (c) Then, the query-video and query-narration similarity matrices obtained through the multi-granularity matching are utilized for training and inference. (d) To enhance the discriminative ability of the model, we additionally use a cross-view hard negative loss during training.
  • Figure 3: Effectiveness of various captioners on MSR-VTT dataset. Note that except for the changes in the captioners, all other architecture is the same.
  • Figure 4: Text-to-video retrieval results (R@1) on MSR-VTT test set. For the Cap4Video, the video-level caption does not match the words in the query. For NarVid, narration contains richer information over frames, and it leads to the correct retrieval results.
  • Figure 5: Process of coarse-grained and fine-grain matching. In the yellow box, we utilize the results of nucleus filtering for weight values.
  • ...and 9 more figures