Table of Contents
Fetching ...

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

Hongxiang Li, Meng Cao, Xuxin Cheng, Yaowei Li, Zhihong Zhu, Yuexian Zou

TL;DR

This work addresses video grounding under semantic overlap and sparse annotations by replacing naive contrastive objectives with geodesic-guided learning and game-theoretic semantics. The proposed Geodesic and Game Localization (G2L) combines geodesic distance-based sampling for robust cross-modal alignment with semantic Shapley interaction modeling to capture fine-grained nuances among similar moments. The training loss comprises $\,\mathcal{L}_{\mathrm{VG}}$, $\mathcal{L}_{\mathrm{GCL}}$, and $\mathcal{L}_{\mathrm{SSI}}$, while inference remains straightforward. Empirical results on ActivityNet-Captions, Charades-STA, and TACoS show consistent improvements over state-of-the-art contrastive-learning baselines, particularly on datasets with strong semantic overlap, indicating the practical value of geodesic and Shapley-informed representations for cross-modal grounding.

Abstract

The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) \emph{alignment} of features of similar samples, and (2) \emph{uniformity} of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, \ie semantic overlapping; (2) only a few moments in the video are annotated, \ie sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method.

G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

TL;DR

This work addresses video grounding under semantic overlap and sparse annotations by replacing naive contrastive objectives with geodesic-guided learning and game-theoretic semantics. The proposed Geodesic and Game Localization (G2L) combines geodesic distance-based sampling for robust cross-modal alignment with semantic Shapley interaction modeling to capture fine-grained nuances among similar moments. The training loss comprises , , and , while inference remains straightforward. Empirical results on ActivityNet-Captions, Charades-STA, and TACoS show consistent improvements over state-of-the-art contrastive-learning baselines, particularly on datasets with strong semantic overlap, indicating the practical value of geodesic and Shapley-informed representations for cross-modal grounding.

Abstract

The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) \emph{alignment} of features of similar samples, and (2) \emph{uniformity} of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, \ie semantic overlapping; (2) only a few moments in the video are annotated, \ie sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method.
Paper Structure (17 sections, 9 equations, 4 figures, 6 tables)

This paper contains 17 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (a) Illustration of video grounding. 'GT' indicates the ground truth. A comparison of (b) existing contrastive learning-based methods and (c) our proposed G2L method. G2L makes semantically similar video moments closer in representation space while exploring nuances among similar moments.
  • Figure 2: Overview of Geodesic and Game Localization (G2L). Our framework encourages the model to learn semantically aligned and uniform joint representations. In the inference stage, we directly fuse the video features and query features to compute the predicted moments. In the training stage, the grounding loss $\mathcal{L}_{\mathrm{VG}}$ is obtained by calculating the cross-entropy between the predicted moment and the target moment. Then, we approximate the high-dimensional manifold structure of the video representations through a moment graph and calculate the geodesic distance from the target moment to other moments. Finally, we leverage geodesic distance for cross-modal discrimination and semantic Sharpley interaction modeling.
  • Figure 3: Projected video moment features (a): learned representations of the previous method with vanilla contrastive learning; (b): learned representations of our method.
  • Figure 4: Qualitative results of our method on the ActivityNet-Captions.