Table of Contents
Fetching ...

Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding

Yuanhao Xiong, Long Zhao, Boqing Gong, Ming-Hsuan Yang, Florian Schroff, Ting Liu, Cho-Jui Hsieh, Liangzhe Yuan

TL;DR

A simple yet effective video-language modeling framework, S-ViLM, that surpasses the state-of-the-art methods substantially on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition, and temporal action localization.

Abstract

Existing video-language pre-training methods primarily focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information in both videos and text, which is of importance to downstream tasks requiring temporal localization and semantic reasoning. A powerful model is expected to be capable of capturing region-object correspondences and recognizing scene changes in a video clip, reflecting spatial and temporal granularity, respectively. To strengthen model's understanding into such fine-grained details, we propose a simple yet effective video-language modeling framework, S-ViLM, by exploiting the intrinsic structures of these two modalities. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features, simultaneously. Comprehensive evaluations demonstrate that S-ViLM performs favorably against existing approaches in learning more expressive representations. Specifically, S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition, and temporal action localization.

Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding

TL;DR

A simple yet effective video-language modeling framework, S-ViLM, that surpasses the state-of-the-art methods substantially on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition, and temporal action localization.

Abstract

Existing video-language pre-training methods primarily focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information in both videos and text, which is of importance to downstream tasks requiring temporal localization and semantic reasoning. A powerful model is expected to be capable of capturing region-object correspondences and recognizing scene changes in a video clip, reflecting spatial and temporal granularity, respectively. To strengthen model's understanding into such fine-grained details, we propose a simple yet effective video-language modeling framework, S-ViLM, by exploiting the intrinsic structures of these two modalities. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features, simultaneously. Comprehensive evaluations demonstrate that S-ViLM performs favorably against existing approaches in learning more expressive representations. Specifically, S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition, and temporal action localization.
Paper Structure (23 sections, 7 equations, 5 figures, 13 tables)

This paper contains 23 sections, 7 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Illustration of S-ViLM pre-training. Three proposed training objectives promote structured video-language interaction: (1) temporal grouping learns temporal-aware features by distinguishing whether clips are from background or foreground; (2) spatial grounding focuses on local correspondences between regions and objects; (3) global contrastive learning matches instance-level $\langle \textit{video, caption}\rangle$ pairs.
  • Figure 2: Visualization of S-ViLM. Left: Similarity scores of features derived from the baseline and our method. Right: Attention maps between region and object with spatial grounding.
  • Figure 3: The structure of a grouping block. It is inserted at different layers of the video encoder to update group tokens by merging semantically similar video tokens.
  • Figure 4: Visualization of spatial grounding. The attention feature map of each example is computed from the corresponding regions assigned to the group token which achieves the highest similarity score with respect to the target noun phrase.
  • Figure 5: Visualization of temporal grouping.