Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

Zhuoning Guo; Mingxin Li; Yanzhao Zhang; Dingkun Long; Pengjun Xie; Xiaowen Chu

Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Xiaowen Chu

TL;DR

This work targets universal video retrieval by addressing misalignment between benchmarks, data, and modeling. It introduces the Universal Video Retrieval Benchmark (UVRB) with $16$ datasets and the Universal Video Retrieval Dataset (UVRD) via the V-SynFlow synthesis pipeline, yielding over $1.55$ million high-quality cross-domain training pairs. A Modality Pyramid curriculum trains a General Video Embedder (GVE) to achieve strong zero-shot generalization across diverse tasks and domains, outperforming 14 baselines. Key findings reveal that partially relevant retrieval best reflects universality, spatial-temporal representations are largely disentangled, model architecture strongly shapes capabilities, and scaling has diminishing returns for visual perception. The framework offers a practical path toward robust, multi-task, cross-domain video retrieval.

Abstract

The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.

Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

TL;DR

Abstract

Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)

Theorems & Definitions (1)