Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank

Jiaxin Wu; Chong-Wah Ngo; Wing-Kwong Chan

Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank

Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan

TL;DR

The paper tackles ad-hoc video search (AVS) by addressing two bottlenecks: limited video-text datasets and out-of-vocabulary queries. It introduces three components—a large-scale synthetic pre-training corpus (WebVid-genCap7M), a syntax-driven multi-word concept bank to capture phrase-level relationships, and a systematic study of advanced textual/visual features within an interpretable embedding framework. The integrated approach yields state-of-the-art results on MSRVTT and TRECVid AVS benchmarks, doubling $R@1$ on MSRVTT and delivering xinfAP improvements ranging from $2\%$ to $77\%$, averaging around $20\%$, across eight TRECVid years. These findings demonstrate that combining scalable pre-training, richer concept representations, and modern features significantly enhances both the interpretability and effectiveness of AVS systems.

Abstract

Aligning a user query and video clips in cross-modal latent space and that with semantic concepts are two mainstream approaches for ad-hoc video search (AVS). However, the effectiveness of existing approaches is bottlenecked by the small sizes of available video-text datasets and the low quality of concept banks, which results in the failures of unseen queries and the out-of-vocabulary problem. This paper addresses these two problems by constructing a new dataset and developing a multi-word concept bank. Specifically, capitalizing on a generative model, we construct a new dataset consisting of 7 million generated text and video pairs for pre-training. To tackle the out-of-vocabulary problem, we develop a multi-word concept bank based on syntax analysis to enhance the capability of a state-of-the-art interpretable AVS method in modeling relationships between query words. We also study the impact of current advanced features on the method. Experimental results show that the integration of the above-proposed elements doubles the R@1 performance of the AVS method on the MSRVTT dataset and improves the xinfAP on the TRECVid AVS query sets for 2016-2023 (eight years) by a margin from 2% to 77%, with an average about 20%.

Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank

TL;DR

on MSRVTT and delivering xinfAP improvements ranging from

, averaging around

, across eight TRECVid years. These findings demonstrate that combining scalable pre-training, richer concept representations, and modern features significantly enhances both the interpretability and effectiveness of AVS systems.

Abstract

Paper Structure (21 sections, 4 figures, 5 tables)

This paper contains 21 sections, 4 figures, 5 tables.

Introduction
Related work
Ad-hoc Video Search
Large-scale Video-Text Datasets
Concept Bank Construction
Improved Interpretable embeddings
Multi-word Concept Bank
Large Video-GenText Dataset for Pre-training
Features Enhancement
Experiments
Experiment Setting
Datasets
Evaluation Metric
Implementation Details
Ablation Studies
...and 6 more sections

Figures (4)

Figure 1: Example video-GenCaption pairs from the WebVid-genCap7M dataset along with the original captions in WebVid2M Bain21_frozenInTime.
Figure 2: Cloud figure of phrases in the multi-word concept bank.
Figure 3: Comparison with the state-of-the-art approaches LAFF and RIVRL on query-554 Find shots of a person holding or operating a tv or movie camera.
Figure :

Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank

TL;DR

Abstract

Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank

Authors

TL;DR

Abstract

Table of Contents

Figures (4)