Table of Contents
Fetching ...

VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval

Issar Tzachor, Dvir Samuel, Rami Ben-Ari

TL;DR

VidVec tackles video–text retrieval by unlocking embeddings inside off-the-shelf multimodal LLMs (MLLMs). It demonstrates that intermediate-layer representations already encode strong cross-modal signals and leverages a calibrated MLLM head for zero-shot re-ranking, plus a lightweight in-context optimization that maps dense video captions to concise summaries using only text data. The approach combines three components—layer-wise readouts, a Yes/No calibrated scorer, and text-driven alignment via Dual-Softmax—to achieve state-of-the-art results without any visual fine-tuning, across MSR-VTT, MSVD, VATEX, and DiDeMo. This data-efficient strategy highlights the latent video–text alignment inside powerful MLLMs and promises practical, scalable retrieval with minimal supervision. The work broadens the utility of LLM backbones for multimodal retrieval and points to future directions in text-centered alignment and efficient prompting for video understanding.

Abstract

Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning without visual supervision. Remarkably, without any fine-tuning beyond text, our method outperforms current methods, often by a substantial margin, achieving state-of-the-art results across common video retrieval benchmarks.

VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval

TL;DR

VidVec tackles video–text retrieval by unlocking embeddings inside off-the-shelf multimodal LLMs (MLLMs). It demonstrates that intermediate-layer representations already encode strong cross-modal signals and leverages a calibrated MLLM head for zero-shot re-ranking, plus a lightweight in-context optimization that maps dense video captions to concise summaries using only text data. The approach combines three components—layer-wise readouts, a Yes/No calibrated scorer, and text-driven alignment via Dual-Softmax—to achieve state-of-the-art results without any visual fine-tuning, across MSR-VTT, MSVD, VATEX, and DiDeMo. This data-efficient strategy highlights the latent video–text alignment inside powerful MLLMs and promises practical, scalable retrieval with minimal supervision. The work broadens the utility of LLM backbones for multimodal retrieval and points to future directions in text-centered alignment and efficient prompting for video understanding.

Abstract

Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning without visual supervision. Remarkably, without any fine-tuning beyond text, our method outperforms current methods, often by a substantial margin, achieving state-of-the-art results across common video retrieval benchmarks.
Paper Structure (24 sections, 4 figures, 11 tables)

This paper contains 24 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: An overview of VidVec. (a) Zero-shot retrieval: extract video and text embeddings from an intermediate MLLM layer for initial ranking. (b) Zero-shot reranking: leverage the calibrated MLLM head for pairwise scoring to rerank top-$K$ candidates. (c) In-context optimization:lightweight model alignment using only $\sim$60K text-only pairs for embedding extraction via a text-to-text mapping from dense video captions to short summaries, designed to mirror the video–text inference setup.
  • Figure 2: MSR-VTT Text-to-Video Retrieval Performance (Recall@1): MLLM Embedders vs. Off-the-Shelf Video MLLM
  • Figure 3: Layer-wise Recall@1 on MSR-VTT for zero-shot video embedding extraction. We evaluate embeddings obtained from different layers across several MLLM backbones. While deeper layers generally yield stronger retrieval performance, the optimal results are not achieved at the final layer. Among all evaluated models, VideoLLaMA3-7B attains the best overall performance.
  • Figure 4: Different Optimization Approaches.