Table of Contents
Fetching ...

DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection

Henghao Zhao, Kevin Qinghong Lin, Rui Yan, Zechao Li

TL;DR

DiffusionVMR addresses the joint problem of video moment retrieval and highlight detection by recasting both tasks as conditional denoising generation tasks within a diffusion framework. It leverages a cross-modal encoder to produce query-guided representations and uses two cascaded denoising decoders to progressively refine candidate spans and saliency scores, starting from Gaussian noise during inference. The approach supports decoupled training and inference, enabling arbitrary inference settings and robust boundary localization through iterative refinement. Empirical results across five benchmarks demonstrate consistent improvements over state-of-the-art methods, especially on hard IoU thresholds, with ablations validating the contribution of each component.

Abstract

Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Given that the video content is continuous in time, there is often a lack of clear boundaries between temporal events in a video. This boundary ambiguity makes it challenging for the model to learn text-video clip correspondences, resulting in the subpar performance of existing methods in predicting target segments. To alleviate this problem, we propose to solve the two tasks jointly from the perspective of denoising generation. Moreover, the target boundary can be localized clearly by iterative refinement from coarse to fine. Specifically, a novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process by combining the diffusion model. During training, Gaussian noise is added to corrupt the ground truth, with noisy candidates produced as input. The model is trained to reverse this noise addition process. In the inference phase, DiffusionVMR initiates directly from Gaussian noise and progressively refines the proposals from the noise to the meaningful output. Notably, the proposed DiffusionVMR inherits the advantages of diffusion models that allow for iteratively refined results during inference, enhancing the boundary transition from coarse to fine. Furthermore, the training and inference of DiffusionVMR are decoupled. An arbitrary setting can be used in DiffusionVMR during inference without consistency with the training phase. Extensive experiments conducted on five widely-used benchmarks (i.e., QVHighlight, Charades-STA, TACoS, YouTubeHighlights and TVSum) across two tasks (moment retrieval and/or highlight detection) demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.

DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection

TL;DR

DiffusionVMR addresses the joint problem of video moment retrieval and highlight detection by recasting both tasks as conditional denoising generation tasks within a diffusion framework. It leverages a cross-modal encoder to produce query-guided representations and uses two cascaded denoising decoders to progressively refine candidate spans and saliency scores, starting from Gaussian noise during inference. The approach supports decoupled training and inference, enabling arbitrary inference settings and robust boundary localization through iterative refinement. Empirical results across five benchmarks demonstrate consistent improvements over state-of-the-art methods, especially on hard IoU thresholds, with ablations validating the contribution of each component.

Abstract

Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Given that the video content is continuous in time, there is often a lack of clear boundaries between temporal events in a video. This boundary ambiguity makes it challenging for the model to learn text-video clip correspondences, resulting in the subpar performance of existing methods in predicting target segments. To alleviate this problem, we propose to solve the two tasks jointly from the perspective of denoising generation. Moreover, the target boundary can be localized clearly by iterative refinement from coarse to fine. Specifically, a novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process by combining the diffusion model. During training, Gaussian noise is added to corrupt the ground truth, with noisy candidates produced as input. The model is trained to reverse this noise addition process. In the inference phase, DiffusionVMR initiates directly from Gaussian noise and progressively refines the proposals from the noise to the meaningful output. Notably, the proposed DiffusionVMR inherits the advantages of diffusion models that allow for iteratively refined results during inference, enhancing the boundary transition from coarse to fine. Furthermore, the training and inference of DiffusionVMR are decoupled. An arbitrary setting can be used in DiffusionVMR during inference without consistency with the training phase. Extensive experiments conducted on five widely-used benchmarks (i.e., QVHighlight, Charades-STA, TACoS, YouTubeHighlights and TVSum) across two tasks (moment retrieval and/or highlight detection) demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.
Paper Structure (17 sections, 17 equations, 8 figures, 9 tables)

This paper contains 17 sections, 17 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: The video moment retrieval and highlight detection can be analogized as an image denoising generation problem: a temporal span(saliency score vector) initialized from noise and progressively refined to cover(highlight) the target moment.
  • Figure 2: Left: Overview of the proposed DiffusionVMR for joint video moment retrieval and highlight detection tasks. Right: The pipeline of the training and inference of video moment retrieval. The moment retrieval branch comprises a cross-modal encoder and a moment denoising decoder. In the training phase, the GT moment undergoes $t$-step diffusion processes and then is provided to the denoising decoder as the noisy span. Each layer in the denoising decoder takes video representation and noisy span as input, outputting the new span updated by the denoised representation. In inference, the noisy spans are directly sampled from Gaussian noise, and the final output from the decoder is used as the result.
  • Figure 3: The pipeline of the highlight detection branch. In each sampling step, the denoising head directly predicts the distribution of $\bm{\hat{x}}_0$ based on $\bm{\hat{x}}_{t}$. For iterative refinement, $\bm{\hat{x}}_{t-1}$ can be derived from $\bm{\hat{x}}_0$ through Eq.\ref{['Diff_infence_diffusion']} and serves as the input for the next step.
  • Figure 4: Effectiveness of different numbers of proposal spans on QVHighlights val split. The training and inference of DiffusionVMR are decoupled. As the proposal quantity increases during inference, the model performance steadily improves.
  • Figure 5: Effectiveness of different sampling steps on QVHighlights val split. S@20 indicates that DiffusionVMR is evaluated under 20 proposal spans using different sampling steps. For all cases, the accuracy increases with refinement times.
  • ...and 3 more figures