SnAG: Scalable and Accurate Video Grounding
Fangzhou Mu, Sicheng Mo, Yin Li
TL;DR
The paper addresses the scalability of temporal video grounding in long videos with many text queries by analyzing cross-modal fusion strategies. It shows that late fusion, coupled with video-centric training, enables scalable inference and training while preserving or improving accuracy, leading to the SnAG baseline. SnAG achieves state-of-the-art or competitive results across long-form benchmarks (MAD, Ego4D-NLQ, TACoS) with significant efficiency gains, and remains strong on short-form datasets (Charades-STA, ActivityNet-Captions). The work provides both theoretical cost insights and extensive empirical validation, highlighting the practical impact of fusion design on scalable vision-language understanding.
Abstract
Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.
