Table of Contents
Fetching ...

An Empirical Study of Excitation and Aggregation Design Adaptions in CLIP4Clip for Video-Text Retrieval

Xiaolun Jing, Genke Yang, Jian Chu

TL;DR

This work addresses the limited expressiveness of mean pooling in video-frame aggregation for video-text retrieval. It introduces excitation-and-aggregation designs that recalibrate frame features and learn discriminative frame-weight distributions, including sequential and tight variants to handle temporal and multi-modal interactions. Empirical results across MSR-VTT, ActivityNet, and DiDeMo show consistent gains, with nuanced behavior depending on temporal modules and data volume. The approach offers a scalable, flexible alternative for frame representation aggregation that can be extended to other CLIP4Clip-based models and cross-modal tasks.

Abstract

CLIP4Clip model transferred from the CLIP has been the de-factor standard to solve the video clip retrieval task from frame-level input, triggering the surge of CLIP4Clip-based models in the video-text retrieval domain. In this work, we rethink the inherent limitation of widely-used mean pooling operation in the frame features aggregation and investigate the adaptions of excitation and aggregation design for discriminative video representation generation. We present a novel excitationand-aggregation design, including (1) The excitation module is available for capturing non-mutuallyexclusive relationships among frame features and achieving frame-wise features recalibration, and (2) The aggregation module is applied to learn exclusiveness used for frame representations aggregation. Similarly, we employ the cascade of sequential module and aggregation design to generate discriminative video representation in the sequential type. Besides, we adopt the excitation design in the tight type to obtain representative frame features for multi-modal interaction. The proposed modules are evaluated on three benchmark datasets of MSR-VTT, ActivityNet and DiDeMo, achieving MSR-VTT (43.9 R@1), ActivityNet (44.1 R@1) and DiDeMo (31.0 R@1). They outperform the CLIP4Clip results by +1.2% (+0.5%), +4.5% (+1.9%) and +9.5% (+2.7%) relative (absolute) improvements, demonstrating the superiority of our proposed excitation and aggregation designs. We hope our work will serve as an alternative for frame representations aggregation and facilitate future research.

An Empirical Study of Excitation and Aggregation Design Adaptions in CLIP4Clip for Video-Text Retrieval

TL;DR

This work addresses the limited expressiveness of mean pooling in video-frame aggregation for video-text retrieval. It introduces excitation-and-aggregation designs that recalibrate frame features and learn discriminative frame-weight distributions, including sequential and tight variants to handle temporal and multi-modal interactions. Empirical results across MSR-VTT, ActivityNet, and DiDeMo show consistent gains, with nuanced behavior depending on temporal modules and data volume. The approach offers a scalable, flexible alternative for frame representation aggregation that can be extended to other CLIP4Clip-based models and cross-modal tasks.

Abstract

CLIP4Clip model transferred from the CLIP has been the de-factor standard to solve the video clip retrieval task from frame-level input, triggering the surge of CLIP4Clip-based models in the video-text retrieval domain. In this work, we rethink the inherent limitation of widely-used mean pooling operation in the frame features aggregation and investigate the adaptions of excitation and aggregation design for discriminative video representation generation. We present a novel excitationand-aggregation design, including (1) The excitation module is available for capturing non-mutuallyexclusive relationships among frame features and achieving frame-wise features recalibration, and (2) The aggregation module is applied to learn exclusiveness used for frame representations aggregation. Similarly, we employ the cascade of sequential module and aggregation design to generate discriminative video representation in the sequential type. Besides, we adopt the excitation design in the tight type to obtain representative frame features for multi-modal interaction. The proposed modules are evaluated on three benchmark datasets of MSR-VTT, ActivityNet and DiDeMo, achieving MSR-VTT (43.9 R@1), ActivityNet (44.1 R@1) and DiDeMo (31.0 R@1). They outperform the CLIP4Clip results by +1.2% (+0.5%), +4.5% (+1.9%) and +9.5% (+2.7%) relative (absolute) improvements, demonstrating the superiority of our proposed excitation and aggregation designs. We hope our work will serve as an alternative for frame representations aggregation and facilitate future research.
Paper Structure (43 sections, 14 equations, 16 figures, 15 tables)

This paper contains 43 sections, 14 equations, 16 figures, 15 tables.

Figures (16)

  • Figure 1: Visualization of the semantic correlation discrepancies among distinct frames for single caption from the MSR-VTT dataset. Green boxes indicate that the caption and frame instances are semantic-relevant, while red boxes depict the semantic-irrelevant examples. Since not all frames are semantic-relevant to the given caption, aggregating frames features through mean pooling regardless of the frame content can be misleading.
  • Figure 2: The pipeline of our proposed method. The model integrates three core components, termed video encoder, text encoder and improved similarity calculator, of which the similarity score can be obtained from the output of improved similarity calculator.
  • Figure 3: Overview of the squeeze excitation-and-aggregation module. The extracted frame features are fed into squeeze excitation module to implement attentive frames enhancement and inattentive frames suppression, followed by the squeeze aggregation module to obtain the video representation.
  • Figure 4: Overview of the expansion excitation-and-aggregation module. The extracted frame features are fed into expansion excitation module to implement attentive frames enhancement and inattentive frames suppression, followed by the expansion aggregation module to obtain the video representation.
  • Figure 5: (a) The diagram of squeeze aggregation module. (b) The diagram of expansion aggregation module. Both aggregation modules obtain the frame-wise weights through two fully-connected layers and a nonlinear activation function in between, followed by a weighted summation operation to aggregate the frame features into video representation.
  • ...and 11 more figures