Table of Contents
Fetching ...

EA-VTR: Event-Aware Video-Text Retrieval

Zongyang Ma, Ziqi Zhang, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Yingmin Luo, Xu Li, Xiaojuan Qi, Ying Shan, Weiming Hu

TL;DR

This work tackles the lack of explicit event content and temporal transitions in web-scale video-text data by introducing Event Content Augmentation (ECA) and Event Temporal Augmentation (ETA) to enrich pre-training corpora. It then presents EA-VTR, a dual-encoder video-text retriever that learns both frame-level event content (ECL) and event temporal transitions (ETL) using a multi-granularity video encoder with Frame [CLS] tokens, trained via Alternating Iteration Training. Empirically, EA-VTR outperforms prior dual-encoder methods on zero-shot and fine-tuned text-to-video retrieval, and demonstrates superior event understanding across Multi-event Video-Text Retrieval, Video Moment Retrieval, and Test of Time, while maintaining high efficiency relative to joint-encoder models. These results highlight the practical impact of explicit event-aware learning for robust video-text alignment and temporal reasoning.

Abstract

Understanding the content of events occurring in the video and their inherent temporal logic is crucial for video-text retrieval. However, web-crawled pre-training datasets often lack sufficient event information, and the widely adopted video-level cross-modal contrastive learning also struggles to capture detailed and complex video-text event alignment. To address these challenges, we make improvements from both data and model perspectives. In terms of pre-training data, we focus on supplementing the missing specific event content and event temporal transitions with the proposed event augmentation strategies. Based on the event-augmented data, we construct a novel Event-Aware Video-Text Retrieval model, ie, EA-VTR, which achieves powerful video-text retrieval ability through superior video event awareness. EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment, ultimately enhancing the comprehensive understanding of video events. Our method not only significantly outperforms existing approaches on multiple datasets for Text-to-Video Retrieval and Video Action Recognition tasks, but also demonstrates superior event content perceive ability on Multi-event Video-Text Retrieval and Video Moment Retrieval tasks, as well as outstanding event temporal logic understanding ability on Test of Time task.

EA-VTR: Event-Aware Video-Text Retrieval

TL;DR

This work tackles the lack of explicit event content and temporal transitions in web-scale video-text data by introducing Event Content Augmentation (ECA) and Event Temporal Augmentation (ETA) to enrich pre-training corpora. It then presents EA-VTR, a dual-encoder video-text retriever that learns both frame-level event content (ECL) and event temporal transitions (ETL) using a multi-granularity video encoder with Frame [CLS] tokens, trained via Alternating Iteration Training. Empirically, EA-VTR outperforms prior dual-encoder methods on zero-shot and fine-tuned text-to-video retrieval, and demonstrates superior event understanding across Multi-event Video-Text Retrieval, Video Moment Retrieval, and Test of Time, while maintaining high efficiency relative to joint-encoder models. These results highlight the practical impact of explicit event-aware learning for robust video-text alignment and temporal reasoning.

Abstract

Understanding the content of events occurring in the video and their inherent temporal logic is crucial for video-text retrieval. However, web-crawled pre-training datasets often lack sufficient event information, and the widely adopted video-level cross-modal contrastive learning also struggles to capture detailed and complex video-text event alignment. To address these challenges, we make improvements from both data and model perspectives. In terms of pre-training data, we focus on supplementing the missing specific event content and event temporal transitions with the proposed event augmentation strategies. Based on the event-augmented data, we construct a novel Event-Aware Video-Text Retrieval model, ie, EA-VTR, which achieves powerful video-text retrieval ability through superior video event awareness. EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment, ultimately enhancing the comprehensive understanding of video events. Our method not only significantly outperforms existing approaches on multiple datasets for Text-to-Video Retrieval and Video Action Recognition tasks, but also demonstrates superior event content perceive ability on Multi-event Video-Text Retrieval and Video Moment Retrieval tasks, as well as outstanding event temporal logic understanding ability on Test of Time task.
Paper Structure (32 sections, 8 equations, 7 figures, 13 tables, 1 algorithm)

This paper contains 32 sections, 8 equations, 7 figures, 13 tables, 1 algorithm.

Figures (7)

  • Figure 1: Examples of missing (a) event content and (b) temporal transitions and corresponding augmentation results. First, the web-crawled video caption in (a) does not contain specific event content. Second, in a video-text pair like (b), the video either lacks event temporal transitions or the caption does not reflect these transitions. Therefore, we propose ECA and ETA to supplement the missing information in both aspects.
  • Figure 2: Overview of the proposed Event Content Augmentation (a) and Event Temporal Augmentation (b) to augment the event information in the pre-training dataset, and EA-VTR model using Event Content Learning (c) and Event Temporal Learning (d) to learn from the augmented data.
  • Figure 3: Examples of frame-event alignment: Above are the extracted video frames, while the bottom left shows multiple events occurring at different times in the video, distinguished by different colors (with key event information in the text also colored). The bottom right displays the similarity score curves between the text features of these events and the visual features of video frames.
  • Figure 4: Examples of ECA. The top of each example are video frames extracted from 4 evenly divided clips of the video (consistent with the input of the video encoder), and below are event captions generated with the image captioner for each video frame.
  • Figure 5: Illustrations of text-to-video retrieval results for the baseline model and our EA-VTR. The top-3 ranked videos are provided for each text query, and 2 representative frames are chosen to represent each video, where the key parts of the text query are highlighted in red and the ground-truth video for each text query is in the red box.
  • ...and 2 more figures