Commonsense for Zero-Shot Natural Language Video Localization

Meghana Holla; Ismini Lourentzou

Commonsense for Zero-Shot Natural Language Video Localization

Meghana Holla, Ismini Lourentzou

TL;DR

This paper tackles zero-shot natural language video localization by introducing CORONET, which adds a Commonsense Enhancement Module to bridge video content and pseudo-queries via ConceptNet-derived relations. The approach combines a Graph Convolutional Network-based concept encoder with cross-modal attention to enrich visual and textual representations before localization, and uses a dynamic moment proposal with pseudo-query generation. Empirical results on Charades-STA and ActivityNet-Captions show substantial gains over zero-shot and weakly supervised baselines, with notable improvements in recall and mIoU, and extensive ablations underscore the importance of temporal commonsense relations and modality-specific enrichment. The work demonstrates that external commonsense knowledge can meaningfully improve zero-shot NLVL and offers a pathway for more robust video-language grounding in open domains.

Abstract

Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.

Commonsense for Zero-Shot Natural Language Video Localization

TL;DR

Abstract

Paper Structure (35 sections, 8 equations, 7 figures, 7 tables)

This paper contains 35 sections, 8 equations, 7 figures, 7 tables.

Introduction
Related Work
Natural Language Video Localization (NLVL)
Weakly Supervised and Zero-shot NLVL Methods
Commonsense in Video-Language Tasks
Commonsense for Zero-Shot NLVL
Problem Formulation
Pseudo-supervised Setup
Dynamic Video Moment Proposal ($f_{\text{span}}$).
Pseudo-query Generation ($f_{\text{pq}}$).
Video Encoder.
Query Encoder.
Commonsense Enhancement Module
Concept Encoder.
Commonsense Information.
...and 20 more sections

Figures (7)

Figure 1: NLVL tasks under various supervision settings. Color-coded boxes show the expected annotations at each supervision level. Full supervision: Temporal Video Annotations + Text Queries; Weak Supervision: Text Queries; Pseudo-Supervision: Only Raw Videos. DVP + DQG; CORONET (Ours, right) Only Raw Videos. DVP + OD and video-informed commonsense knowledge subgraph.
Figure 2: CORONET consists of a Video Encoder and a Query Encoder, the proposed Commonsense Enhancement, a Cross-modal (video-query) Interaction, and a Temporal Regression module. During training, CORONET utilizes a Dynamic Video Moment Proposal module to extract a video moment span $V_{\text{span}}$ and an off-the-shelf object detector to detect objects (nouns) in $V_{\text{span}}$. During inference, the given natural language query is converted to a simplified query using a part-of-speech tagger.
Figure 3: CORONET Commonsense Enhancement Module (CEM). CEM comprises a concept encoder and an enhancement mechanism that uses the previously encoded concept vectors to update a given input vector (video/query vectors). The concept encoder employs a Graph Convolution Network for encoding the nodes (concepts) of $G_C$.
Figure 4: Qualitative inference results on examples from Charades-STA test data. Video span timestamps predicted by CORONET (purple lines), PSVL (orange lines), and LFVL (blue lines), juxtaposed with ground truth timestamps (green lines).
Figure 5: CORONET performance with enhancement vs. query-concepts concatenation for 300 (left) and 250 (right) seed concept sizes.
...and 2 more figures

Commonsense for Zero-Shot Natural Language Video Localization

TL;DR

Abstract

Commonsense for Zero-Shot Natural Language Video Localization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)