Table of Contents
Fetching ...

Harnessing Object Grounding for Time-Sensitive Video Understanding

Tz-Ying Wu, Sharath Nittur Sridhar, Subarna Tripathi

TL;DR

The paper tackles time-sensitive video understanding by incorporating grounded object information into Video-LLMs through a token-efficient GO-Tokenizer. It demonstrates that compact object tokens derived from ROI pooling and frame-wise time embeddings can outperform text-based GO prompts and vanilla Video-LLMs on TSV tasks such as temporal localization and dense captioning. Key contributions include the GO-Tokenizer architecture, end-to-end training with LoRA, and extensive ablations showing robustness to detector choice, frame sampling, and object count. The approach generalizes across models and datasets, improving TSV performance while mitigating token length concerns and noise sensitivity, with practical implications for more accurate and scalable video understanding systems.

Abstract

We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO). We hypothesize that TSV tasks can benefit from GO within frames, which is supported by our preliminary experiments on LITA, a state-of-the-art Video-LLM for reasoning temporal localization. While augmenting prompts with textual descriptions of these object annotations improves the performance of LITA, it also introduces extra token length and susceptibility to the noise in object-level information. To address this, we propose GO-Tokenizer, a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly. Experimental results demonstrate that pretraining with GO-Tokenizer outperforms the vanilla Video-LLM and its counterpart, utilizing textual descriptions of objects in the prompt. The gain generalizes across different models, datasets, and video understanding tasks, such as reasoning temporal localization and dense captioning.

Harnessing Object Grounding for Time-Sensitive Video Understanding

TL;DR

The paper tackles time-sensitive video understanding by incorporating grounded object information into Video-LLMs through a token-efficient GO-Tokenizer. It demonstrates that compact object tokens derived from ROI pooling and frame-wise time embeddings can outperform text-based GO prompts and vanilla Video-LLMs on TSV tasks such as temporal localization and dense captioning. Key contributions include the GO-Tokenizer architecture, end-to-end training with LoRA, and extensive ablations showing robustness to detector choice, frame sampling, and object count. The approach generalizes across models and datasets, improving TSV performance while mitigating token length concerns and noise sensitivity, with practical implications for more accurate and scalable video understanding systems.

Abstract

We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO). We hypothesize that TSV tasks can benefit from GO within frames, which is supported by our preliminary experiments on LITA, a state-of-the-art Video-LLM for reasoning temporal localization. While augmenting prompts with textual descriptions of these object annotations improves the performance of LITA, it also introduces extra token length and susceptibility to the noise in object-level information. To address this, we propose GO-Tokenizer, a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly. Experimental results demonstrate that pretraining with GO-Tokenizer outperforms the vanilla Video-LLM and its counterpart, utilizing textual descriptions of objects in the prompt. The gain generalizes across different models, datasets, and video understanding tasks, such as reasoning temporal localization and dense captioning.

Paper Structure

This paper contains 29 sections, 4 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Video tokens in existing Video-LLMs litahuang2024 are highly compressed and compromise on spatial information. We harness GO information in sparsely sampled video frames as a supplement to improve the time-sensitive video understanding (TSV) tasks.
  • Figure 1: Using different levels of ground truth GO information as context to evaluate LITA-13B litahuang2024 on ActivityNet-RTL-GO.
  • Figure 2: (Left) Model architecture of GO-Video. The LLM input space is augmented with the object tokens extracted by the GO-Tokenizer. GO information is extracted with off-the-shelf object detectors during inference. (Right) GO-Tokenizer extracts the object semantics, locations and time information of each object into a single object token.
  • Figure 3: ROI-Patch-Pool. All the patches covered by the bounding box are average-pooled.
  • Figure 4: Evaluating LITA-13B on the ActivityNet-RTL dataset with/without GO-Tokenizer. YOLO-World YOLOWorldCheng2024 is adopted at inference time to extract the GO information.
  • ...and 3 more figures