Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval

Taichi Nishimura; Shota Nakada; Masayoshi Kondo

Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval

Taichi Nishimura, Shota Nakada, Masayoshi Kondo

TL;DR

This work tackles partially relevant video retrieval by replacing dense frame encodings with grid-based super images, reducing visual encodings by $1/N^2$ and enabling large vision-language models (VLMs) to be deployed efficiently. It introduces query-attentive super image retrieval (QASIR) in zero-shot, fine-tuned, and hybrid forms, with adapters and a temporal encoder to align VLMs to PRVR tasks. Zero-shot QASIR demonstrates generalization of VLMs to super images and reveals trade-offs among grid size, resolution, and model scale; fine-tuned and hybrid QASIR further improve performance while controlling computation costs. The approach yields strong recall gains over traditional sampling methods, demonstrates potential for transfer to T2VR and captioning, and offers practical, scalable PRVR for long-form video data.

Abstract

In this paper, we propose an efficient and high-performance method for partially relevant video retrieval, which aims to retrieve long videos that contain at least one moment relevant to the input text query. The challenge lies in encoding dense frames using visual backbones. This requires models to handle the increased frames, resulting in significant computation costs for long videos. To mitigate the costs, previous studies use lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities. However, it is undesirable to simply replace the backbones with high-performance large vision-and-language models (VLMs) due to their low efficiency. To address this dilemma, instead of dense frames, we focus on super images, which are created by rearranging the video frames in an $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and mitigates the low efficiency of large VLMs. Based on this idea, we make two contributions. First, we explore whether VLMs generalize to super images in a zero-shot setting. To this end, we propose a method called query-attentive super image retrieval (QASIR), which attends to partial moments relevant to the input query. The zero-shot QASIR yields two discoveries: (1) it enables VLMs to generalize to super images and (2) the grid size $N$, image resolution, and VLM size are key trade-off parameters between performance and computation costs. Second, we introduce fine-tuning and hybrid QASIR that combines high- and low-efficiency models to strike a balance between performance and computation costs. This reveals two findings: (1) the fine-tuning QASIR enhances VLMs to learn super images effectively, and (2) the hybrid QASIR minimizes the performance drop of large VLMs while reducing the computation costs.

Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval

TL;DR

This work tackles partially relevant video retrieval by replacing dense frame encodings with grid-based super images, reducing visual encodings by

and enabling large vision-language models (VLMs) to be deployed efficiently. It introduces query-attentive super image retrieval (QASIR) in zero-shot, fine-tuned, and hybrid forms, with adapters and a temporal encoder to align VLMs to PRVR tasks. Zero-shot QASIR demonstrates generalization of VLMs to super images and reveals trade-offs among grid size, resolution, and model scale; fine-tuned and hybrid QASIR further improve performance while controlling computation costs. The approach yields strong recall gains over traditional sampling methods, demonstrates potential for transfer to T2VR and captioning, and offers practical, scalable PRVR for long-form video data.

Abstract

grid layout. This reduces the number of visual encodings to

and mitigates the low efficiency of large VLMs. Based on this idea, we make two contributions. First, we explore whether VLMs generalize to super images in a zero-shot setting. To this end, we propose a method called query-attentive super image retrieval (QASIR), which attends to partial moments relevant to the input query. The zero-shot QASIR yields two discoveries: (1) it enables VLMs to generalize to super images and (2) the grid size

, image resolution, and VLM size are key trade-off parameters between performance and computation costs. Second, we introduce fine-tuning and hybrid QASIR that combines high- and low-efficiency models to strike a balance between performance and computation costs. This reveals two findings: (1) the fine-tuning QASIR enhances VLMs to learn super images effectively, and (2) the hybrid QASIR minimizes the performance drop of large VLMs while reducing the computation costs.

Paper Structure (24 sections, 4 equations, 10 figures, 8 tables)

This paper contains 24 sections, 4 equations, 10 figures, 8 tables.

Introduction
Related work
Approach
Preliminary: super images
Zero-shot QASIR
Fine-tuning QASIR
Hybrid QASIR of high- and low-efficiency models
Experiments
Experimental settings
Zero-shot evaluation
Fine-tuning evaluation
Comparison to other sampling approaches
Moment-to-video performance
Qualitative results
Are super images effective for other video-language tasks?
...and 9 more sections

Figures (10)

Figure 1: (a) Overview of PRVR task. (b) Key concept of our approach. Super images are created by rearranging frames in $N \times N$ grid layout, enabling the model to reduce number of encodings through visual backbones.
Figure 2: Comparison of original video frames and created super images in different layouts: $2\times2$, $3\times3$, $4\times4$, $5\times5$, and $6\times6$. Note that super images are placed onto grid in up-to-down and left-to-right order.
Figure 3: (a) Overview of zero-shot QASIR. (b) Fine-tuning QASIR. Given textual query $\hbox{\boldmath{$\hat{z}$}}_t$, model computes $\hbox{\boldmath{$\hat{z}$}}_t$-weighted super image vectors for positive $\hbox{\boldmath{$\hat{z}$}}_v^+$ and negative $\hbox{\boldmath{$\hat{z}$}}_v^-$ pairs. Then, their cosine similarity $\cos(\hbox{\boldmath{$\hat{z}$}}_v^+,\hbox{\boldmath{$\hat{z}$}}_t)$ and $\cos(\hbox{\boldmath{$\hat{z}$}}_v^-,\hbox{\boldmath{$\hat{z}$}}_t)$ is used for loss calculation.
Figure 4: Recall@$K$ of the high-effcient model and sumR change when varying $K$ and $R$. (a) and (b) present the combination of $3\times3$/$2\times2$ and $4\times4$/$2\times2$ models, respectively.
Figure 5: Moment-to-video performance on benchmark datasets. Note that we cannot evaluate M/V performance on MS-SL (original) on Charades-STA because original feature files and model weights are currently unavailable.
...and 5 more figures

Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval

TL;DR

Abstract

Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (10)