Harnessing Object Grounding for Time-Sensitive Video Understanding
Tz-Ying Wu, Sharath Nittur Sridhar, Subarna Tripathi
TL;DR
The paper tackles time-sensitive video understanding by incorporating grounded object information into Video-LLMs through a token-efficient GO-Tokenizer. It demonstrates that compact object tokens derived from ROI pooling and frame-wise time embeddings can outperform text-based GO prompts and vanilla Video-LLMs on TSV tasks such as temporal localization and dense captioning. Key contributions include the GO-Tokenizer architecture, end-to-end training with LoRA, and extensive ablations showing robustness to detector choice, frame sampling, and object count. The approach generalizes across models and datasets, improving TSV performance while mitigating token length concerns and noise sensitivity, with practical implications for more accurate and scalable video understanding systems.
Abstract
We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO). We hypothesize that TSV tasks can benefit from GO within frames, which is supported by our preliminary experiments on LITA, a state-of-the-art Video-LLM for reasoning temporal localization. While augmenting prompts with textual descriptions of these object annotations improves the performance of LITA, it also introduces extra token length and susceptibility to the noise in object-level information. To address this, we propose GO-Tokenizer, a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly. Experimental results demonstrate that pretraining with GO-Tokenizer outperforms the vanilla Video-LLM and its counterpart, utilizing textual descriptions of objects in the prompt. The gain generalizes across different models, datasets, and video understanding tasks, such as reasoning temporal localization and dense captioning.
