Table of Contents
Fetching ...

Exploiting Auxiliary Caption for Video Grounding

Hongxiang Li, Meng Cao, Xuxin Cheng, Zhihong Zhu, Yaowei Li, Yuexian Zou

TL;DR

The paper tackles the sparse annotation problem in video grounding by exploiting auxiliary captions generated from dense video captioning. It introduces ACNet, which incorporates Non-Auxiliary Caption Suppression (NACS) to select reliable auxiliary captions, Caption Guided Attention (CGA) to inject prior temporal and semantic cues into visual features, and Asymmetric Cross-modal Contrastive Learning (ACCL) to mine informative negative signals without harming exact moment localization. This combination yields improved grounding performance across ActivityNet Captions, TACoS, and ActivityNet-CG, with ablations confirming the individual and combined contributions of NACS, CGA, and ACCL. The approach demonstrates strong generalization and practical impact for robust video-language grounding in the presence of sparse annotations."

Abstract

Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video. Previous works ignore the {sparsity dilemma} in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. In this paper, we contend that exploiting easily available captions which describe general actions, i.e., auxiliary captions defined in our paper, will significantly boost the performance. To this end, we propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS). To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and query sentences into temporal space and fuse them into visual representations. Considering the gap between auxiliary captions and ground truth, we propose Asymmetric Cross-modal Contrastive Learning (ACCL) for constructing more negative pairs to maximize cross-modal mutual information. Extensive experiments on three public datasets (i.e., ActivityNet Captions, TACoS and ActivityNet-CG) demonstrate that our method significantly outperforms state-of-the-art methods.

Exploiting Auxiliary Caption for Video Grounding

TL;DR

The paper tackles the sparse annotation problem in video grounding by exploiting auxiliary captions generated from dense video captioning. It introduces ACNet, which incorporates Non-Auxiliary Caption Suppression (NACS) to select reliable auxiliary captions, Caption Guided Attention (CGA) to inject prior temporal and semantic cues into visual features, and Asymmetric Cross-modal Contrastive Learning (ACCL) to mine informative negative signals without harming exact moment localization. This combination yields improved grounding performance across ActivityNet Captions, TACoS, and ActivityNet-CG, with ablations confirming the individual and combined contributions of NACS, CGA, and ACCL. The approach demonstrates strong generalization and practical impact for robust video-language grounding in the presence of sparse annotations."

Abstract

Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video. Previous works ignore the {sparsity dilemma} in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. In this paper, we contend that exploiting easily available captions which describe general actions, i.e., auxiliary captions defined in our paper, will significantly boost the performance. To this end, we propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS). To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and query sentences into temporal space and fuse them into visual representations. Considering the gap between auxiliary captions and ground truth, we propose Asymmetric Cross-modal Contrastive Learning (ACCL) for constructing more negative pairs to maximize cross-modal mutual information. Extensive experiments on three public datasets (i.e., ActivityNet Captions, TACoS and ActivityNet-CG) demonstrate that our method significantly outperforms state-of-the-art methods.
Paper Structure (16 sections, 10 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 10 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: The sparse annotation dilemma in video grounding. The annotated captions (marked by green) in the dataset are sparse while there still exist many uncovered captions (marked by red). This 218-second video from ActivityNet Captions with 2 annotations.
  • Figure 2: Performance comparison with ACNet and two representative models MMNwang2022negative2D-TANzhang2020learning with dense caption data augmentation (w/ DA) on ActivityNet Captions. $l_c$ denotes the number of additional moment-sentence pairs per video.
  • Figure 3: Overview of the proposed Auxiliary Caption Network (ACNet). Auxiliary Caption is filtered through our proposed Non-Auxiliary Caption Suppression algorithm (NACS) from PDVC PDVCwang2021end outputs. We convert the timestamp of the auxiliary caption to the 2D map form following 2D-TANzhang2020learningMMNwang2022negative. Then, video segments and query sentences are encoded by the respective feature encoders for regression learning and cross-modal contrastive learning. In the regression branch, Caption Guided Attention (CGA) calculates semantic relations between query features $Q_r$ and auxiliary caption features $Q_r^t$. Then we project them to visual space to obtain visual representations $V_r^\prime$ with prior knowledge. $V_r^\prime$ and query features $Q_r$ are used for prediction and loss computation. In the cross-modal learning branch, the encoded video features $V_c$ and query features $Q_c$ are directly fed into the prediction module and loss function. $\otimes$ and $\odot$ indicate matrix multiplication and Hadamard product, respectively.
  • Figure 4: Illustration of our Caption Guided Attention (CGA).
  • Figure 5: Illustration of our asymmetric push-and-pull strategy, in contrast to those in the original supervised contrastive learning, where elements with the same color mean they come from the same moment-sentence pair. $\mathcal{G}$ and $\mathcal{D}$ are the sets of moment-sentence pairs of ground truth and auxiliary caption, respectively.