Table of Contents
Fetching ...

Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Xiaowen Chu

TL;DR

This work targets universal video retrieval by addressing misalignment between benchmarks, data, and modeling. It introduces the Universal Video Retrieval Benchmark (UVRB) with $16$ datasets and the Universal Video Retrieval Dataset (UVRD) via the V-SynFlow synthesis pipeline, yielding over $1.55$ million high-quality cross-domain training pairs. A Modality Pyramid curriculum trains a General Video Embedder (GVE) to achieve strong zero-shot generalization across diverse tasks and domains, outperforming 14 baselines. Key findings reveal that partially relevant retrieval best reflects universality, spatial-temporal representations are largely disentangled, model architecture strongly shapes capabilities, and scaling has diminishing returns for visual perception. The framework offers a practical path toward robust, multi-task, cross-domain video retrieval.

Abstract

The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.

Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

TL;DR

This work targets universal video retrieval by addressing misalignment between benchmarks, data, and modeling. It introduces the Universal Video Retrieval Benchmark (UVRB) with datasets and the Universal Video Retrieval Dataset (UVRD) via the V-SynFlow synthesis pipeline, yielding over million high-quality cross-domain training pairs. A Modality Pyramid curriculum trains a General Video Embedder (GVE) to achieve strong zero-shot generalization across diverse tasks and domains, outperforming 14 baselines. Key findings reveal that partially relevant retrieval best reflects universality, spatial-temporal representations are largely disentangled, model architecture strongly shapes capabilities, and scaling has diminishing returns for visual perception. The framework offers a practical path toward robust, multi-task, cross-domain video retrieval.

Abstract

The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.

Paper Structure

This paper contains 59 sections, 2 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: We propose Universal Video Retrieval (UVR) that retrieves videos with multi-task, cross-domain queries, which can be achieved via benchmark-data-model co-design in this work.
  • Figure 2: Model performance on UVRB for 16 datasets and 9 abilities (3 main tasks and 6 (sub-) domains).
  • Figure 3: V-SynFlow: a multi-stage synthesis workflow for diverse video retrieval data.
  • Figure 4: The architecture of GVE, a MLLM-based embedding model. We only fine-tune the LLM part. GVE inputs compositional multimodal elements and outputs a high-dimensional vector as an embedding.
  • Figure 5: Modality Pyramid: simpler tasks lay the foundation for specific ones.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Definition 1: Universal Video Retrieval (UVR)