Table of Contents
Fetching ...

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

Jingjing Hu, Dan Guo, Kun Li, Zhan Si, Xun Yang, Xiaojun Chang, Meng Wang

TL;DR

UniSDNet addresses natural and spoken language video grounding by unifying static cross-modal semantics with dynamic video context. It couples a ResMLP-based Static Semantic Supplement Network with a Dynamic Temporal Filtering Network that uses a diffusive video clip graph and a multi-kernel Temporal Gaussian Filter, followed by 2D moment proposal generation and cosine-based modality alignment. The model is trained with IoU regression and contrastive losses, achieving state-of-the-art results on ActivityNet Captions, TACoS, Charades-STA, ActivityNet Speech, and the newly introduced SLVG datasets, while remaining efficient with a compact parameter footprint. This work demonstrates that a biologically inspired two-stage design can substantially improve cross-modal grounding performance and inference efficiency, and it provides new SLVG datasets to advance the field.

Abstract

Inspired by the activity-silent and persistent activity mechanisms in human visual perception biology, we design a Unified Static and Dynamic Network (UniSDNet), to learn the semantic association between the video and text/audio queries in a cross-modal environment for efficient video grounding. For static modeling, we devise a novel residual structure (ResMLP) to boost the global comprehensive interaction between the video segments and queries, achieving more effective semantic enhancement/supplement. For dynamic modeling, we effectively exploit three characteristics of the persistent activity mechanism in our network design for a better video context comprehension. Specifically, we construct a diffusely connected video clip graph on the basis of 2D sparse temporal masking to reflect the "short-term effect" relationship. We innovatively consider the temporal distance and relevance as the joint "auxiliary evidence clues" and design a multi-kernel Temporal Gaussian Filter to expand the context clue into high-dimensional space, simulating the "complex visual perception", and then conduct element level filtering convolution operations on neighbour clip nodes in message passing stage for finally generating and ranking the candidate proposals. Our UniSDNet is applicable to both Natural Language Video Grounding (NLVG) and Spoken Language Video Grounding (SLVG) tasks. Our UniSDNet achieves SOTA performance on three widely used datasets for NLVG, as well as three datasets for SLVG, e.g., reporting new records at 38.88% R@1,IoU@0.7 on ActivityNet Captions and 40.26% R@1,IoU@0.5 on TACoS. To facilitate this field, we collect two new datasets (Charades-STA Speech and TACoS Speech) for SLVG task. Meanwhile, the inference speed of our UniSDNet is 1.56$\times$ faster than the strong multi-query benchmark. Code is available at: https://github.com/xian-sh/UniSDNet.

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

TL;DR

UniSDNet addresses natural and spoken language video grounding by unifying static cross-modal semantics with dynamic video context. It couples a ResMLP-based Static Semantic Supplement Network with a Dynamic Temporal Filtering Network that uses a diffusive video clip graph and a multi-kernel Temporal Gaussian Filter, followed by 2D moment proposal generation and cosine-based modality alignment. The model is trained with IoU regression and contrastive losses, achieving state-of-the-art results on ActivityNet Captions, TACoS, Charades-STA, ActivityNet Speech, and the newly introduced SLVG datasets, while remaining efficient with a compact parameter footprint. This work demonstrates that a biologically inspired two-stage design can substantially improve cross-modal grounding performance and inference efficiency, and it provides new SLVG datasets to advance the field.

Abstract

Inspired by the activity-silent and persistent activity mechanisms in human visual perception biology, we design a Unified Static and Dynamic Network (UniSDNet), to learn the semantic association between the video and text/audio queries in a cross-modal environment for efficient video grounding. For static modeling, we devise a novel residual structure (ResMLP) to boost the global comprehensive interaction between the video segments and queries, achieving more effective semantic enhancement/supplement. For dynamic modeling, we effectively exploit three characteristics of the persistent activity mechanism in our network design for a better video context comprehension. Specifically, we construct a diffusely connected video clip graph on the basis of 2D sparse temporal masking to reflect the "short-term effect" relationship. We innovatively consider the temporal distance and relevance as the joint "auxiliary evidence clues" and design a multi-kernel Temporal Gaussian Filter to expand the context clue into high-dimensional space, simulating the "complex visual perception", and then conduct element level filtering convolution operations on neighbour clip nodes in message passing stage for finally generating and ranking the candidate proposals. Our UniSDNet is applicable to both Natural Language Video Grounding (NLVG) and Spoken Language Video Grounding (SLVG) tasks. Our UniSDNet achieves SOTA performance on three widely used datasets for NLVG, as well as three datasets for SLVG, e.g., reporting new records at 38.88% R@1,IoU@0.7 on ActivityNet Captions and 40.26% R@1,IoU@0.5 on TACoS. To facilitate this field, we collect two new datasets (Charades-STA Speech and TACoS Speech) for SLVG task. Meanwhile, the inference speed of our UniSDNet is 1.56 faster than the strong multi-query benchmark. Code is available at: https://github.com/xian-sh/UniSDNet.
Paper Structure (45 sections, 11 equations, 19 figures, 16 tables)

This paper contains 45 sections, 11 equations, 19 figures, 16 tables.

Figures (19)

  • Figure 1: A schematic illustration of the biology behind how people understand the events of a video during solving video grounding tasks. Firstly, according to the theory of GNW (Global Neuronal Workspace) deco2021revisiting, the brain engages in static multimodal information association to achieve semantic complements between multimodalities. Then the focus will be brought to the dynamic perception of the video content along the timeline, and during which three characteristics will be expressed: 1) Short-term Effect: the most recent perceptions have a high impact on the present; 2) Relevance Clues: semantically scenes will provide clues to help understand the current scene; 3) Perception Complexity: visual perception is high-dimensional and non-linear barbosa2020interplay.
  • Figure 2: An illustrating example for the video grounding task (query: text or audio). This video is described by four queries (events), all of which have separate semantic contexts and temporal dependencies. Other queries can provide a global context (antecedents and consequences) for the current query (e.g., query $Q4$). Besides, historical similar scenarios (such as in the blue dashed box) help to discover relevant event clues (time and semantic clues) for understanding the current scenario (blue solid box).
  • Figure 3: The architecture of the Unified Static and Dynamic Network (UniSDNet). It mainly consists of static and dynamic networks: Static Semantic Supplement Network (S$^3$Net) and Dynamic Temporal Filtering Network (DTFNet). S$^3$Net concatenates video clips and multiple queries into a sequence and encodes them through a lightweight single-stream ResMLP network. DTFNet is a 2-layer graph network with a dynamic Gaussian filtering convolution mechanism, which is designed to control message passing between nodes by considering temporal distance and semantic relevance as the Gaussian filtering clues when updating node features. The role of 2D temporal map is to retain possible candidate proposals and represent them by aggregating the features of each proposal moment. Finally, we perform semantic matching between the queries and proposals and rank the best ones as the predictions.
  • Figure 4: The process of (a) node message aggregation in the Dynamic Temporal Filtering graph and (b) dynamic filter-generator $Filter$, which is built based on the joint clue of relevance weight $a_{ij}$ and relative temporal distance $r_{ij}$ between two nodes. This joint clue is expanded into high dimensions representation through a multi-kernel Gaussian radial basis function.
  • Figure 5: Statistics on the query number size of each video in training set for NLVG&SLVG datasets (1k=1,000). The datasets can be divided into three categories: large query size (TACoS & TACoS Speech, most sizes are 110), middle query size (ActivityNet Captions & ActivityNet Speech, most sizes are 3), and small query size (Charades-STA & Charades-STA Speech, most sizes are 1, and the query description is often ambiguous and semantically insufficient as the video is too short with mostly 30s duration for manually annotating events).
  • ...and 14 more figures