Table of Contents
Fetching ...

ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models

Avinash Madasu, Vasudev Lal

TL;DR

This paper investigates how video retrieval models comprehend the compositional and syntactic structure of text captions. It introduces a perturbation-based evaluation across MSRVTT, MSVD, and DiDeMo to separately probe objects & attributes, actions, and syntax for 12 competitive VR models, including both video-text pre-trained and CLIP-based approaches. The findings show objects & attributes provide the strongest signal, actions offer moderate disambiguation, and syntax/word-order have limited impact, with CLIP-based methods exhibiting superior syntactic and compositional understanding. These insights illuminate current VR model behavior and guide future design toward more robust, compositional-aware video retrieval systems.

Abstract

Video retrieval (VR) involves retrieving the ground truth video from the video database given a text caption or vice-versa. The two important components of compositionality: objects & attributes and actions are joined using correct syntax to form a proper text query. These components (objects & attributes, actions and syntax) each play an important role to help distinguish among videos and retrieve the correct ground truth video. However, it is unclear what is the effect of these components on the video retrieval performance. We therefore, conduct a systematic study to evaluate the compositional and syntactic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO. The study is performed on two categories of video retrieval models: (i) which are pre-trained on video-text pairs and fine-tuned on downstream video retrieval datasets (Eg. Frozen-in-Time, Violet, MCQ etc.) (ii) which adapt pre-trained image-text representations like CLIP for video retrieval (Eg. CLIP4Clip, XCLIP, CLIP2Video etc.). Our experiments reveal that actions and syntax play a minor role compared to objects & attributes in video understanding. Moreover, video retrieval models that use pre-trained image-text representations (CLIP) have better syntactic and compositional understanding as compared to models pre-trained on video-text data. The code is available at https://github.com/IntelLabs/multimodal_cognitive_ai/tree/main/ICSVR

ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models

TL;DR

This paper investigates how video retrieval models comprehend the compositional and syntactic structure of text captions. It introduces a perturbation-based evaluation across MSRVTT, MSVD, and DiDeMo to separately probe objects & attributes, actions, and syntax for 12 competitive VR models, including both video-text pre-trained and CLIP-based approaches. The findings show objects & attributes provide the strongest signal, actions offer moderate disambiguation, and syntax/word-order have limited impact, with CLIP-based methods exhibiting superior syntactic and compositional understanding. These insights illuminate current VR model behavior and guide future design toward more robust, compositional-aware video retrieval systems.

Abstract

Video retrieval (VR) involves retrieving the ground truth video from the video database given a text caption or vice-versa. The two important components of compositionality: objects & attributes and actions are joined using correct syntax to form a proper text query. These components (objects & attributes, actions and syntax) each play an important role to help distinguish among videos and retrieve the correct ground truth video. However, it is unclear what is the effect of these components on the video retrieval performance. We therefore, conduct a systematic study to evaluate the compositional and syntactic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO. The study is performed on two categories of video retrieval models: (i) which are pre-trained on video-text pairs and fine-tuned on downstream video retrieval datasets (Eg. Frozen-in-Time, Violet, MCQ etc.) (ii) which adapt pre-trained image-text representations like CLIP for video retrieval (Eg. CLIP4Clip, XCLIP, CLIP2Video etc.). Our experiments reveal that actions and syntax play a minor role compared to objects & attributes in video understanding. Moreover, video retrieval models that use pre-trained image-text representations (CLIP) have better syntactic and compositional understanding as compared to models pre-trained on video-text data. The code is available at https://github.com/IntelLabs/multimodal_cognitive_ai/tree/main/ICSVR
Paper Structure (20 sections, 3 figures, 4 tables)

This paper contains 20 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: We perform ablation studies on the role of objects & attributes in video retrieval. The video retrieval models are evaluated on three tasks namely: Object shift ($Q_{objshift}$), Object replacement ($Q_{objrep}$) and Object partial ($Q_{objpartial}$). Results show that swapping of objects has minor effect on performance followed by masking 50% objects. The highest drop is seen when the objects are randomly replaced. These ablation studies are performed on MSRVTT xu2016msr and MSVD chen2011collecting datasets
  • Figure 2: Figure shows the performance comparison (R@1 score) of video retrieval models on action ablation studies. The VR models are evaluated on captions with negated actions and replaced actions from MSRVTT xu2016msr and MSVD chen2011collecting datasets respectively. These studies illustrate that VR models have incomplete knowledge of negation and also are immune to action replacement in the captions
  • Figure 3: Video retrieval performance (R@1) on word order task. We test the models on original (unchanged) captions, captions with shuffled word order and captions with reversed word order for MSRVTT and DiDeMo datasets. We demonstrate that VR models act like bag-of-words and do not require substantial word order information.