Multi-event Video-Text Retrieval

Gengyuan Zhang; Jisen Ren; Jindong Gu; Volker Tresp

Multi-event Video-Text Retrieval

Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp

TL;DR

MeVTR tackles the realistic scenario in which videos contain multiple events while user texts describe single events, revealing degraded performance of traditional bijective VTR models. The authors propose Me-Retriever, a CLIP-based architecture that represents each video as a bag of key events via K-Medoids clustering on frame embeddings and optimizes a MeVTR loss $L_{ ext{MeVTR}} = L_{ ext{v2t}} + \alpha L_{ ext{t2v}}$ with a dynamic $\alpha$ to balance V2T and T2V learning; it also uses a disjoint softmax treatment to avoid textual feature collapse. Comprehensive experiments on ActivityNet Captions and Charades-Event show that Me-Retriever, particularly with the avg similarity, achieves strong V2T and competitive T2V performance, establishing a robust baseline for MeVTR. The results highlight the importance of multi-event video representations and adaptive loss balancing for practical cross-modal retrieval.

Abstract

Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at https://github.com/gengyuanmax/MeVTR.

Multi-event Video-Text Retrieval

TL;DR

with a dynamic

to balance V2T and T2V learning; it also uses a disjoint softmax treatment to avoid textual feature collapse. Comprehensive experiments on ActivityNet Captions and Charades-Event show that Me-Retriever, particularly with the avg similarity, achieves strong V2T and competitive T2V performance, establishing a robust baseline for MeVTR. The results highlight the importance of multi-event video representations and adaptive loss balancing for practical cross-modal retrieval.

Abstract

Paper Structure (15 sections, 6 equations, 6 figures, 7 tables)

This paper contains 15 sections, 6 equations, 6 figures, 7 tables.

Introduction
Related Work
Problem
Problem Formulation
Evaluation Metrics
Textual Feature Collapse
Method
Key Event Video Representation
MeVTR Loss
Experiments
Experiment Details
Results
Ablation Studies
Limitations
Conclusion

Figures (6)

Figure 1: An example case of multi-event videos from ActivityNet caba2015activitynet. The video depicts a sequence of unrelated and discontinuous events, including the progression "a girl is sitting on the beach" $\rightarrow$ "a young man is practicing tightrope walking" $\rightarrow$ "a scene of sunset by the beach." Each textual caption only corresponds to a fragment of the video. Such short and specific textual captions are prevalent in our everyday video data and constitute a common video-text retrieval scenario.
Figure 2: The overall framework of Me-Retriever. The model adopts CLIP radford2021learning's Visual Encoder(VE) and Text Encoder(TE). After the Visual Encoder, [CLASS] tokens in the last hidden layer are taken as frame embeddings. We use a clustering-based Key Event Selection module to aggregate similar frames and extract key events. Each textual caption is fed into Text Encoder, and [EOS] will be used as text embedding. The similarity between these key events of any video $v_i$ and any textual caption $t_j$ is measured in the Similarity Calculator. For each video, there are multiple text correspondences as positive samples.
Figure 3: We compare the average cosine similarity between all text pairs of videos with a different number of events. Me-Retriever can generate more diverse text features than CLIP4Clip and refrain from text features collapsing, as we discuss in the main part.
Figure 4: We compare average performance improvement in percentage($\%$) for Video-to-Text task on subsets of ActivityNet Captions. It shows how much percentage Me-Retriever(avg) is better than CLIP4Clip(mean) on different subsets. Fig. \ref{['sfig:ab']}: Video-to-Text results for test-S/M/L/XL; Fig. \ref{['sfig:ba']}: Video-to-Text results for test-E1/E2/E3.
Figure 5: We compare the model performance with different weighting strategies, a dynamic weight $\alpha$ and different choices of fixed weights from a choice of $\{0.5, 1.0, 1.5, 2.0, 3.0\}$, in the MeVTR loss on $\text{Recall}@5$ on the Video-to-Text task for ActivityNet Captions. Fig. \ref{['sfig1:a']}-\ref{['sfig1:c']} shows the results of $\text{Recall}@5$-Average/One-Hit/All-Hit respectively. We can find that compared to a fixed weighting coefficient, a dynamic weight $\alpha$ guarantees more stable and good results in different metrics.
...and 1 more figures

Multi-event Video-Text Retrieval

TL;DR

Abstract

Multi-event Video-Text Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (6)