Table of Contents
Fetching ...

Exploring Iterative Refinement with Diffusion Models for Video Grounding

Xiao Liang, Tao Shi, Yaoyuan Liang, Te Tao, Shao-Lun Huang

TL;DR

DiffusionVG reframes video grounding as a conditional generation task where the target moment span $z_0=(\tau_s,\tau_e)$ is generated from Gaussian noise through a learned reverse diffusion conditioned on video and text features. A video-centered multi-modal encoder and a span refining decoder enable iterative denoising of multiple span hypotheses, with DDIM-based sampling and a voting strategy to select the final span. Training uses a forward diffusion on the ground-truth span, a cosine noise schedule, and a composite loss combining $L_1$ and IoU terms, while auxiliary decoder losses accelerate convergence. Experiments on Charades-STA, ActivityNet Captions, and TACoS demonstrate state-of-the-art performance and reveal the method's robustness to sampling steps and the number of queries, highlighting diffusion models as a viable tool for temporally grounded video understanding.

Abstract

Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query. Existing methods typically select the best prediction from a set of predefined proposals or directly regress the target span in a single-shot manner, resulting in the absence of a systematical prediction refinement process. In this paper, we propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task, where the target span is generated from Gaussian noise inputs and interatively refined in the reverse diffusion process. During training, DiffusionVG progressively adds noise to the target span with a fixed forward diffusion process and learns to recover the target span in the reverse diffusion process. In inference, DiffusionVG can generate the target span from Gaussian noise inputs by the learned reverse diffusion process conditioned on the video-sentence representations. Without bells and whistles, our DiffusionVG demonstrates superior performance compared to existing well-crafted models on mainstream Charades-STA, ActivityNet Captions and TACoS benchmarks.

Exploring Iterative Refinement with Diffusion Models for Video Grounding

TL;DR

DiffusionVG reframes video grounding as a conditional generation task where the target moment span is generated from Gaussian noise through a learned reverse diffusion conditioned on video and text features. A video-centered multi-modal encoder and a span refining decoder enable iterative denoising of multiple span hypotheses, with DDIM-based sampling and a voting strategy to select the final span. Training uses a forward diffusion on the ground-truth span, a cosine noise schedule, and a composite loss combining and IoU terms, while auxiliary decoder losses accelerate convergence. Experiments on Charades-STA, ActivityNet Captions, and TACoS demonstrate state-of-the-art performance and reveal the method's robustness to sampling steps and the number of queries, highlighting diffusion models as a viable tool for temporally grounded video understanding.

Abstract

Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query. Existing methods typically select the best prediction from a set of predefined proposals or directly regress the target span in a single-shot manner, resulting in the absence of a systematical prediction refinement process. In this paper, we propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task, where the target span is generated from Gaussian noise inputs and interatively refined in the reverse diffusion process. During training, DiffusionVG progressively adds noise to the target span with a fixed forward diffusion process and learns to recover the target span in the reverse diffusion process. In inference, DiffusionVG can generate the target span from Gaussian noise inputs by the learned reverse diffusion process conditioned on the video-sentence representations. Without bells and whistles, our DiffusionVG demonstrates superior performance compared to existing well-crafted models on mainstream Charades-STA, ActivityNet Captions and TACoS benchmarks.
Paper Structure (16 sections, 4 equations, 8 figures, 8 tables, 2 algorithms)

This paper contains 16 sections, 4 equations, 8 figures, 8 tables, 2 algorithms.

Figures (8)

  • Figure 1: (a) An illustration of the video grounding task, which aims to locate a target moment semantically corresponding to a sentence query in a video. (b) Pipeline of proposal-based VG methods. (c) Pipeline of proposal-free VG methods. (d) Our proposed DiffusionVG, which formulates VG as a conditional generative task with diffusion models.
  • Figure 2: Overview of DiffusionVG. Taking a video and a sentence query as input, DiffusionVG first extracts features from both modalities and facilitates interaction using a video-centered multi-modal encoder. Subsequently, the predictions of the target span are generated in a span refining decoder conditioned on the encoded multi-modal representations and iteratively refined through the reverse diffusion process. The final prediction of the target span is selected via a voting process.
  • Figure 3: Evaluating the impact of sampling steps and number of queries on model performance and inference speed (5 queries). All experiments are conducted on Charades-STA test set. Black markers denote the default setting.
  • Figure 4: A visualized example of the proposed DiffusionVG on ActivityNet Captions dataset (only one query is adopted).
  • Figure 5: Illustration of the span refining decoder. (Span location embedding is not illustrated.)
  • ...and 3 more figures