Table of Contents
Fetching ...

WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary Knowledge

Huy Le, Tung Kieu, Anh Nguyen, Ngan Le

TL;DR

WAVER is introduced, a cross-domain knowledge distillation framework via vision-language models through open-vocabulary knowledge designed to tackle the challenge of handling different writing styles in video descriptions.

Abstract

Text-video retrieval, a prominent sub-field within the domain of multimodal information retrieval, has witnessed remarkable growth in recent years. However, existing methods assume video scenes are consistent with unbiased descriptions. These limitations fail to align with real-world scenarios since descriptions can be influenced by annotator biases, diverse writing styles, and varying textual perspectives. To overcome the aforementioned problems, we introduce $\texttt{WAVER}$, a cross-domain knowledge distillation framework via vision-language models through open-vocabulary knowledge designed to tackle the challenge of handling different writing styles in video descriptions. $\texttt{WAVER}$ capitalizes on the open-vocabulary properties that lie in pre-trained vision-language models and employs an implicit knowledge distillation approach to transfer text-based knowledge from a teacher model to a vision-based student. Empirical studies conducted across four standard benchmark datasets, encompassing various settings, provide compelling evidence that $\texttt{WAVER}$ can achieve state-of-the-art performance in text-video retrieval task while handling writing-style variations. The code is available at: https://github.com/Fsoft-AIC/WAVER

WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary Knowledge

TL;DR

WAVER is introduced, a cross-domain knowledge distillation framework via vision-language models through open-vocabulary knowledge designed to tackle the challenge of handling different writing styles in video descriptions.

Abstract

Text-video retrieval, a prominent sub-field within the domain of multimodal information retrieval, has witnessed remarkable growth in recent years. However, existing methods assume video scenes are consistent with unbiased descriptions. These limitations fail to align with real-world scenarios since descriptions can be influenced by annotator biases, diverse writing styles, and varying textual perspectives. To overcome the aforementioned problems, we introduce , a cross-domain knowledge distillation framework via vision-language models through open-vocabulary knowledge designed to tackle the challenge of handling different writing styles in video descriptions. capitalizes on the open-vocabulary properties that lie in pre-trained vision-language models and employs an implicit knowledge distillation approach to transfer text-based knowledge from a teacher model to a vision-based student. Empirical studies conducted across four standard benchmark datasets, encompassing various settings, provide compelling evidence that can achieve state-of-the-art performance in text-video retrieval task while handling writing-style variations. The code is available at: https://github.com/Fsoft-AIC/WAVER
Paper Structure (12 sections, 6 equations, 2 figures, 8 tables)

This paper contains 12 sections, 6 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Overview framework of WAVER. We propose a cross-domain KD, in which open-vocab knowledge from pre-trained VLM implicitly acts as the teacher. By distilling text-based knowledge from the large-capacity teacher into a vision-based student model (described in Section \ref{['sec:med']}) to handle writing-style variations.
  • Figure 2: The details of Video Content Dictionary Generator. As described in Section \ref{['sec:VCD']}, we first form a list of all activities extracted from each caption. Then we propose a matching scheme to form a set of video-related vocabularies and select the top-$\kappa$ most relevant vocabularies to create a VCD.