Table of Contents
Fetching ...

Generative Video Diffusion for Unseen Novel Semantic Video Moment Retrieval

Dezhao Luo, Shaogang Gong, Jiabo Huang, Hailin Jin, Yang Liu

TL;DR

This work tackles unseen semantic video moment retrieval by removing the need for target-domain videos during training. It introduces Fine-grained Video Editing (FVE), a diffusion-based framework that edits source-domain videos to embody unseen target semantics while preserving subject and background details. The approach combines an instance-preserving diffusion model with a temporal layer and a hybrid data-selection strategy (cross-modal relevance, uni-modal structure, and model-performance disparity) to curate high-quality synthetic training data. Experiments across Charades-STA, QVHighlights, and TaCoS demonstrate improved cross-domain VMR performance and effective action editing, highlighting the method's potential for scalable generalisation to novel concepts in video understanding.

Abstract

Video moment retrieval (VMR) aims to locate the most likely video moment(s) corresponding to a text query in untrimmed videos. Training of existing methods is limited by the lack of diverse and generalisable VMR datasets, hindering their ability to generalise moment-text associations to queries containing novel semantic concepts (unseen both visually and textually in a training source domain). For model generalisation to novel semantics, existing methods rely heavily on assuming to have access to both video and text sentence pairs from a target domain in addition to the source domain pair-wise training data. This is neither practical nor scalable. In this work, we introduce a more generalisable approach by assuming only text sentences describing new semantics are available in model training without having seen any videos from a target domain. To that end, we propose a Fine-grained Video Editing framework, termed FVE, that explores generative video diffusion to facilitate fine-grained video editing from the seen source concepts to the unseen target sentences consisting of new concepts. This enables generative hypotheses of unseen video moments corresponding to the novel concepts in the target domain. This fine-grained generative video diffusion retains the original video structure and subject specifics from the source domain while introducing semantic distinctions of unseen novel vocabularies in the target domain. A critical challenge is how to enable this generative fine-grained diffusion process to be meaningful in optimising VMR, more than just synthesising visually pleasing videos. We solve this problem by introducing a hybrid selection mechanism that integrates three quantitative metrics to selectively incorporate synthetic video moments (novel video hypotheses) as enlarged additions to the original source training data, whilst minimising potential ...

Generative Video Diffusion for Unseen Novel Semantic Video Moment Retrieval

TL;DR

This work tackles unseen semantic video moment retrieval by removing the need for target-domain videos during training. It introduces Fine-grained Video Editing (FVE), a diffusion-based framework that edits source-domain videos to embody unseen target semantics while preserving subject and background details. The approach combines an instance-preserving diffusion model with a temporal layer and a hybrid data-selection strategy (cross-modal relevance, uni-modal structure, and model-performance disparity) to curate high-quality synthetic training data. Experiments across Charades-STA, QVHighlights, and TaCoS demonstrate improved cross-domain VMR performance and effective action editing, highlighting the method's potential for scalable generalisation to novel concepts in video understanding.

Abstract

Video moment retrieval (VMR) aims to locate the most likely video moment(s) corresponding to a text query in untrimmed videos. Training of existing methods is limited by the lack of diverse and generalisable VMR datasets, hindering their ability to generalise moment-text associations to queries containing novel semantic concepts (unseen both visually and textually in a training source domain). For model generalisation to novel semantics, existing methods rely heavily on assuming to have access to both video and text sentence pairs from a target domain in addition to the source domain pair-wise training data. This is neither practical nor scalable. In this work, we introduce a more generalisable approach by assuming only text sentences describing new semantics are available in model training without having seen any videos from a target domain. To that end, we propose a Fine-grained Video Editing framework, termed FVE, that explores generative video diffusion to facilitate fine-grained video editing from the seen source concepts to the unseen target sentences consisting of new concepts. This enables generative hypotheses of unseen video moments corresponding to the novel concepts in the target domain. This fine-grained generative video diffusion retains the original video structure and subject specifics from the source domain while introducing semantic distinctions of unseen novel vocabularies in the target domain. A critical challenge is how to enable this generative fine-grained diffusion process to be meaningful in optimising VMR, more than just synthesising visually pleasing videos. We solve this problem by introducing a hybrid selection mechanism that integrates three quantitative metrics to selectively incorporate synthetic video moments (novel video hypotheses) as enlarged additions to the original source training data, whilst minimising potential ...
Paper Structure (14 sections, 11 equations, 3 figures, 6 tables)

This paper contains 14 sections, 11 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Our designed instance-preserving action editing model. We first take the video as a set of images and train an image diffusion model to align a special text token with the instance shared between those frames. Subsequently, we take those frames as a sequence and freeze the layers in the image diffusion model, and append a temporal layer to capture the video motions.
  • Figure 2: Data generation and hybrid selection. For data generation, we first train the video diffusion model $\phi$ to align moment $m_i$ with a sentence $p_i$, then we use an editing prompt $p_e$ to edit the moment to $m_e^i$. The hybrid selection strategy includes a cross-modal relevance and uni-modal structure score to select high-quality generation, as well as a model performance disparity to select beneficial data for VMR training.
  • Figure 3: Qualitative comparisons. The first and last frames of the video are presented.