Table of Contents
Fetching ...

Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval

Jun Li, Xuhang Lou, Jinpeng Wang, Yuting Wang, Yaowei Wang, Shu-Tao Xia, Bin Chen

Abstract

Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses. To address these issues, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching. Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perception. The registers are then adaptively fused with video tokens through register-augmented Gaussian attention blocks, enabling context-aware feature learning. Extensive experiments show that DreamPRVR outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/CVPR26-DreamPRVR.

Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval

Abstract

Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses. To address these issues, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching. Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perception. The registers are then adaptively fused with video tokens through register-augmented Gaussian attention blocks, enabling context-aware feature learning. Extensive experiments show that DreamPRVR outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/CVPR26-DreamPRVR.

Paper Structure

This paper contains 23 sections, 34 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Limited contextual awareness causes spurious local spikes on a globally irrelevant "people boating" video, incorrectly outscoring the ground-truth "demonstrating the accordion through the camera" video which exhibits more overall relevance. (b) MIL considers only the closest pair, leading to sparse supervision where others remain undertrained. (c) DreamPRVR first imagines global registers through text-supervised diffusion, then concentrates on fine-grained learning, thereby realizing coarse-to-fine cross-modal alignment and jointly optimizing all video tokens to form a coherent embedding space.
  • Figure 2: Overview of DreamPRVR. (a) The query branch produces embedding $\bm{q}$ and samples $\hat{\bm{q}}$ via TPS to supervise register generation. Video embeddings from a pre-trained model are first processed by a lightweight feature encoder and a probabilistic variational sampler (PVS) to produce the initial register ${\bm{r}}_{T}$, which is iteratively denoised via a truncated diffusion model to yield optimal registers ${\bm{r}}_{0}$. ${\bm{r}}_{0}$ subsequently enhance frame- and clip-level representation learning, getting frame embeddings $\bm{V}_f$ and clip embeddings $\bm{V}_c$. $\bm{q}$ learns a latent semantic structure through $L_{\text{tssl}}$ and computes similarity scores $S_f$ and $S_c$. (b) Textual Perturbation Sampler (TPS) models query uncertainty via controllable perturbations and samples $\hat{\bm{q}}$ without trainable parameters. (c) Textual Semantic Structure Learning $L_{\text{tssl}}$ employs $L_{\text{div}}$ to diversify queries and $L_{\text{qsp}}$ to align queries from the same video while contrasting across videos. (d) The asymmetric attention mask defines two cross-attention patterns, enabling full interactions for video tokens while constraining registers to video-only attention.
  • Figure 3: The influence of the number of registers and the number of diffusion timesteps, with default settings marked in bold.
  • Figure 4: The t-SNE visualization of the learned textual space. Data points of the same color denote queries from the same video.
  • Figure 5: A qualitative case study of retrieval results. The same videos in \ref{['fig:Intro']} are selected for a better comparison.
  • ...and 2 more figures