Table of Contents
Fetching ...

Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video Localization

Chongzhi Zhang, Mingyuan Zhang, Zhiyang Teng, Jiayi Li, Xizhou Zhu, Lewei Lu, Ziwei Liu, Aixin Sun

TL;DR

The paper tackles Natural Language Video Localization by reframing it as generating a global 2D temporal map conditioned on video and language inputs. It introduces a multi-scale diffusion framework based on DDIM to iteratively denoise a 2D score map, using a specialized, condition-injected decoder and a multimodal feature encoder to fuse video and text cues. Key contributions include (i) a 2D temporal map representation with multi-scale maps, (ii) a diffusion-based generation objective trained with MSE on full 2D maps, (iii) a time-aware stylization mechanism for progressive denoising, and (iv) extensive ablations showing that concatenation-based conditioning and full time-information interaction yield the best performance. Empirically, the approach achieves state-of-the-art or competitive results on Charades-STA and DiDeMo, illustrating the viability and benefits of diffusion models for multimodal understanding tasks and providing a new paradigm for NLVL with strong temporal modeling capabilities.

Abstract

Natural Language Video Localization (NLVL), grounding phrases from natural language descriptions to corresponding video segments, is a complex yet critical task in video understanding. Despite ongoing advancements, many existing solutions lack the capability to globally capture temporal dynamics of the video data. In this study, we present a novel approach to NLVL that aims to address this issue. Our method involves the direct generation of a global 2D temporal map via a conditional denoising diffusion process, based on the input video and language query. The main challenges are the inherent sparsity and discontinuity of a 2D temporal map in devising the diffusion decoder. To address these challenges, we introduce a multi-scale technique and develop an innovative diffusion decoder. Our approach effectively encapsulates the interaction between the query and video data across various time scales. Experiments on the Charades and DiDeMo datasets underscore the potency of our design.

Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video Localization

TL;DR

The paper tackles Natural Language Video Localization by reframing it as generating a global 2D temporal map conditioned on video and language inputs. It introduces a multi-scale diffusion framework based on DDIM to iteratively denoise a 2D score map, using a specialized, condition-injected decoder and a multimodal feature encoder to fuse video and text cues. Key contributions include (i) a 2D temporal map representation with multi-scale maps, (ii) a diffusion-based generation objective trained with MSE on full 2D maps, (iii) a time-aware stylization mechanism for progressive denoising, and (iv) extensive ablations showing that concatenation-based conditioning and full time-information interaction yield the best performance. Empirically, the approach achieves state-of-the-art or competitive results on Charades-STA and DiDeMo, illustrating the viability and benefits of diffusion models for multimodal understanding tasks and providing a new paradigm for NLVL with strong temporal modeling capabilities.

Abstract

Natural Language Video Localization (NLVL), grounding phrases from natural language descriptions to corresponding video segments, is a complex yet critical task in video understanding. Despite ongoing advancements, many existing solutions lack the capability to globally capture temporal dynamics of the video data. In this study, we present a novel approach to NLVL that aims to address this issue. Our method involves the direct generation of a global 2D temporal map via a conditional denoising diffusion process, based on the input video and language query. The main challenges are the inherent sparsity and discontinuity of a 2D temporal map in devising the diffusion decoder. To address these challenges, we introduce a multi-scale technique and develop an innovative diffusion decoder. Our approach effectively encapsulates the interaction between the query and video data across various time scales. Experiments on the Charades and DiDeMo datasets underscore the potency of our design.
Paper Structure (20 sections, 7 equations, 3 figures, 5 tables)

This paper contains 20 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of the NLVL task (top part) and 2D temporal map (bottom part). Top: The NLVL model processes a language query and an untrimmed video to locate a temporal moment that semantically corresponds to the query. Bottom: The 2D temporal map plots candidate moments at coordinates $(i, j)$, starting at $i\tau$ and lasting for $(j+1)\tau$; here $\tau=5\text{s}$ is the time scale. The map is displayed as a heatmap, where the values in cells indicate the predicted matching scores between candidate moments and the target moment. Note that the "ground truth" in the figure is for illustration purpose, and is not available during model inference.
  • Figure 2: Overview of our proposed multi-scale 2D temporal map diffusion model. (a) Illustration of the forward and reverse processes in our 2D temporal map-based diffusion model. (b) Design of the stylization block utilized for time information interaction. (c) Framework of the proposed model, incorporating a multimodal feature encoder and a condition-injected decoder.
  • Figure 3: Visualizations of predicted 2D maps for two samples (from Charades-STA), generated by the MS-2D-TAN model and our diffusion model. The diffusion model consistently produces 2D maps with a recognizable pattern, despite occasional incorrect predictions.