Infusing Environmental Captions for Long-Form Video Language Grounding
Hyogun Lee, Soyeon Hong, Mujeen Sung, Jinwoo Choi
TL;DR
This work tackles long-form video-language grounding (LFVLG), where ground-truth moments are sparsely distributed within long videos. It introduces EI-VLG, a method that infuses environment cues from a multi-modal large language model-generated caption stream into a video-language grounding model to drastically reduce search space. The approach comprises an Environment Encoder with a frozen caption generator and a learnable text encoder trained via a marginal log-likelihood objective, an Environment Infuser that fuses environment cues with video features through cross-attention, and a flexible Video-Language Grounding model. Evaluated on EgoNLQ, EI-VLG achieves state-of-the-art performance on most metrics, demonstrating the practical value of incorporating rich, externally generated environmental descriptions for robust long-form video understanding.
Abstract
In this work, we tackle the problem of long-form video-language grounding (VLG). Given a long-form video and a natural language query, a model should temporally localize the precise moment that answers the query. Humans can easily solve VLG tasks, even with arbitrarily long videos, by discarding irrelevant moments using extensive and robust knowledge gained from experience. Unlike humans, existing VLG methods are prone to fall into superficial cues learned from small-scale datasets, even when they are within irrelevant frames. To overcome this challenge, we propose EI-VLG, a VLG method that leverages richer textual information provided by a Multi-modal Large Language Model (MLLM) as a proxy for human experiences, helping to effectively exclude irrelevant frames. We validate the effectiveness of the proposed method via extensive experiments on a challenging EgoNLQ benchmark.
