Table of Contents
Fetching ...

CLIPRerank: An Extremely Simple Method for Improving Ad-hoc Video Search

Aozhu Chen, Fangming Zhou, Ziyuan Wang, Xirong Li

TL;DR

Ad-hoc Video Search (AVS) often suffers when the most relevant content for a given textual query is a short segment within a longer video. CLIPRerank introduces a lightweight, model-agnostic reranking method that leverages frame-level CLIP similarities: compute frame scores $S(q,f_i) = \cos(TE(q), IE(f_i))$, aggregate by max pooling to obtain $S(q,v)$, and combine with the initial model score via $S_{re}(q,v) = \alpha M(q,v) + (1-\alpha) S(q,v)$ with $\alpha = 0.4$. Extensive experiments on the TRECVID AVS benchmarks TV16–TV21 show consistent improvements across a wide range of baselines and state-of-the-art models, indicating that fine-grained LVLM-based reranking is a valuable plug-in for AVS. The approach is efficient (≈22 ms per query), compatible with alternatives like BLIP-2, and highlights the practical impact of frame-level cross-modal re-scoring in video retrieval scenarios.

Abstract

Ad-hoc Video Search (AVS) enables users to search for unlabeled video content using on-the-fly textual queries. Current deep learning-based models for AVS are trained to optimize holistic similarity between short videos and their associated descriptions. However, due to the diversity of ad-hoc queries, even for a short video, its truly relevant part w.r.t. a given query can be of shorter duration. In such a scenario, the holistic similarity becomes suboptimal. To remedy the issue, we propose in this paper CLIPRerank, a fine-grained re-scoring method. We compute cross-modal similarities between query and video frames using a pre-trained CLIP model, with multi-frame scores aggregated by max pooling. The fine-grained score is weightedly added to the initial score for search result reranking. As such, CLIPRerank is agnostic to the underlying video retrieval models and extremely simple, making it a handy plug-in for boosting AVS. Experiments on the challenging TRECVID AVS benchmarks (from 2016 to 2021) justify the effectiveness of the proposed strategy. CLIPRerank consistently improves the TRECVID top performers and multiple existing models including SEA, W2VV++, Dual Encoding, Dual Task, LAFF, CLIP2Video, TS2-Net and X-CLIP. Our method also works when substituting BLIP-2 for CLIP.

CLIPRerank: An Extremely Simple Method for Improving Ad-hoc Video Search

TL;DR

Ad-hoc Video Search (AVS) often suffers when the most relevant content for a given textual query is a short segment within a longer video. CLIPRerank introduces a lightweight, model-agnostic reranking method that leverages frame-level CLIP similarities: compute frame scores , aggregate by max pooling to obtain , and combine with the initial model score via with . Extensive experiments on the TRECVID AVS benchmarks TV16–TV21 show consistent improvements across a wide range of baselines and state-of-the-art models, indicating that fine-grained LVLM-based reranking is a valuable plug-in for AVS. The approach is efficient (≈22 ms per query), compatible with alternatives like BLIP-2, and highlights the practical impact of frame-level cross-modal re-scoring in video retrieval scenarios.

Abstract

Ad-hoc Video Search (AVS) enables users to search for unlabeled video content using on-the-fly textual queries. Current deep learning-based models for AVS are trained to optimize holistic similarity between short videos and their associated descriptions. However, due to the diversity of ad-hoc queries, even for a short video, its truly relevant part w.r.t. a given query can be of shorter duration. In such a scenario, the holistic similarity becomes suboptimal. To remedy the issue, we propose in this paper CLIPRerank, a fine-grained re-scoring method. We compute cross-modal similarities between query and video frames using a pre-trained CLIP model, with multi-frame scores aggregated by max pooling. The fine-grained score is weightedly added to the initial score for search result reranking. As such, CLIPRerank is agnostic to the underlying video retrieval models and extremely simple, making it a handy plug-in for boosting AVS. Experiments on the challenging TRECVID AVS benchmarks (from 2016 to 2021) justify the effectiveness of the proposed strategy. CLIPRerank consistently improves the TRECVID top performers and multiple existing models including SEA, W2VV++, Dual Encoding, Dual Task, LAFF, CLIP2Video, TS2-Net and X-CLIP. Our method also works when substituting BLIP-2 for CLIP.
Paper Structure (6 sections, 2 equations, 3 figures, 2 tables)

This paper contains 6 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Assessing CLIPRerank in the TRECVID AVS task.
  • Figure 2: Per-query analysis on TV22. We use the same experimental setups as LAFF$^*$ to test on the latest V3C2 test set with queries of TV22. BLIP-2 is used for re-scoring.
  • Figure 3: Top-10 video search results by LAFF* and LAFF* + CLIPRerank, respectively. Queries selected from TV22.