Table of Contents
Fetching ...

SnAG: Scalable and Accurate Video Grounding

Fangzhou Mu, Sicheng Mo, Yin Li

TL;DR

The paper addresses the scalability of temporal video grounding in long videos with many text queries by analyzing cross-modal fusion strategies. It shows that late fusion, coupled with video-centric training, enables scalable inference and training while preserving or improving accuracy, leading to the SnAG baseline. SnAG achieves state-of-the-art or competitive results across long-form benchmarks (MAD, Ego4D-NLQ, TACoS) with significant efficiency gains, and remains strong on short-form datasets (Charades-STA, ActivityNet-Captions). The work provides both theoretical cost insights and extensive empirical validation, highlighting the practical impact of fusion design on scalable vision-language understanding.

Abstract

Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.

SnAG: Scalable and Accurate Video Grounding

TL;DR

The paper addresses the scalability of temporal video grounding in long videos with many text queries by analyzing cross-modal fusion strategies. It shows that late fusion, coupled with video-centric training, enables scalable inference and training while preserving or improving accuracy, leading to the SnAG baseline. SnAG achieves state-of-the-art or competitive results across long-form benchmarks (MAD, Ego4D-NLQ, TACoS) with significant efficiency gains, and remains strong on short-form datasets (Charades-STA, ActivityNet-Captions). The work provides both theoretical cost insights and extensive empirical validation, highlighting the practical impact of fusion design on scalable vision-language understanding.

Abstract

Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.
Paper Structure (19 sections, 12 equations, 7 figures, 8 tables)

This paper contains 19 sections, 12 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: SnAG achieves the best accuracy and throughput simultaneously on the MAD dataset soldan2022mad for long-form video grounding.
  • Figure 2: (a) Cross-modal fusion is key to a video grounding model $\mathcal{F}$. Models using early fusion jointly encode video and sentence query. SnAG revisits late fusion for scalable video grounding by decoupling expensive video encoding and inexpensive query encoding. (b) Video-centric model evaluation. With late fusion, the output of video encoder can be cached and re-used by queries of the same video in both training and inference. (c) Mini-batch sampling in training. Previous methods adopt query-centric sampling (query → video) whereas SnAG resorts to video-centric sampling (video → many queries) for efficient training. (d) Model overview. SnAG is a simple instantiation of late fusion and video-centric training for video grounding. It separately encodes a video and its queries using Transformers, applies simple cross-attention for cross-modal fusion, and decodes moments represented as points using lightweight convolutional heads.
  • Figure 3: Visualization of dataset statistics. Circle radius is in proportion to average number of queries per video. Long-video benchmarks (Ego4D-NLQ, MAD and TACoS) consist of longer videos with more queries and exhibit low moment coverage compared to short-video benchmarks (ANet-Captions and Charades).
  • Figure 4: Visualization of moment predictions. SnAG can (a) comprehend complex text queries with multiple objects and actions; (b) reason about temporal ordering of events.
  • Figure 5: (a) Model capacity. T, E, M denotes TACoS, Ego4D and MAD, respectively (also in (b)). Shaded bar: % of parameters, Colored bar: % of MACs. SnAG places more parameters and compute in video encoder for less expensive per-query evaluation (fusion + decoding). (b, c) Video-centric vs. query-centric inference (b) and training (c). SnAG saves up to 50% training time, 40% GPU memory and 80% test time, and delivers up to 40% faster convergence relative to query-centric training / inference. (d) Effect of $B_q$ on training efficiency and test accuracy on MAD. Training on MAD is faster and takes less GPU memory as $B_q$ grows, while the test accuracy is unaffected.
  • ...and 2 more figures