CLIPRerank: An Extremely Simple Method for Improving Ad-hoc Video Search
Aozhu Chen, Fangming Zhou, Ziyuan Wang, Xirong Li
TL;DR
Ad-hoc Video Search (AVS) often suffers when the most relevant content for a given textual query is a short segment within a longer video. CLIPRerank introduces a lightweight, model-agnostic reranking method that leverages frame-level CLIP similarities: compute frame scores $S(q,f_i) = \cos(TE(q), IE(f_i))$, aggregate by max pooling to obtain $S(q,v)$, and combine with the initial model score via $S_{re}(q,v) = \alpha M(q,v) + (1-\alpha) S(q,v)$ with $\alpha = 0.4$. Extensive experiments on the TRECVID AVS benchmarks TV16–TV21 show consistent improvements across a wide range of baselines and state-of-the-art models, indicating that fine-grained LVLM-based reranking is a valuable plug-in for AVS. The approach is efficient (≈22 ms per query), compatible with alternatives like BLIP-2, and highlights the practical impact of frame-level cross-modal re-scoring in video retrieval scenarios.
Abstract
Ad-hoc Video Search (AVS) enables users to search for unlabeled video content using on-the-fly textual queries. Current deep learning-based models for AVS are trained to optimize holistic similarity between short videos and their associated descriptions. However, due to the diversity of ad-hoc queries, even for a short video, its truly relevant part w.r.t. a given query can be of shorter duration. In such a scenario, the holistic similarity becomes suboptimal. To remedy the issue, we propose in this paper CLIPRerank, a fine-grained re-scoring method. We compute cross-modal similarities between query and video frames using a pre-trained CLIP model, with multi-frame scores aggregated by max pooling. The fine-grained score is weightedly added to the initial score for search result reranking. As such, CLIPRerank is agnostic to the underlying video retrieval models and extremely simple, making it a handy plug-in for boosting AVS. Experiments on the challenging TRECVID AVS benchmarks (from 2016 to 2021) justify the effectiveness of the proposed strategy. CLIPRerank consistently improves the TRECVID top performers and multiple existing models including SEA, W2VV++, Dual Encoding, Dual Task, LAFF, CLIP2Video, TS2-Net and X-CLIP. Our method also works when substituting BLIP-2 for CLIP.
