Table of Contents
Fetching ...

SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

Jiwook Han, Geo Ahn, Youngrae Kim, Jinwoo Choi

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

Paper Structure

This paper contains 17 sections, 11 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Motivation. Zero-shot MLLMs lack fine-grained temporal understanding, producing incorrect timestamps in both settings. Fine-tuning on a VTG dataset resolves this for In-Domain (ID) videos (left), but on Out-of-Domain (OOD) videos the model predicts timestamps based on dataset-specific shortcuts rather than the actual visual content (right). Our method leverages object-centric visual representations (bottom) that decompose each frame into semantic entities, encouraging genuine visual grounding in both seen and unseen settings.
  • Figure 2: Observations. We naively fine-tune Qwen2.5-VL-3B bai2025qwen25vl on Charades-STA (Cha.) gao2017tall (source) and evaluate on QVHighlights (QVH) lei2021detecting (target). (a) ID vs. OOD performance. The model achieves 63.4 R1@0.5 on ID but drops to 43.6 on OOD, confirming severe overfitting to dataset-specific patterns. (b) Visual similarity analysis. We extract visual features from the vision encoder and compute cosine similarity between ID and OOD samples; performance on the most similar 20% of OOD samples (52.8) far exceeds the most dissimilar 20% (39.1), indicating that the model fails when visual distribution shifts. (c) Noise perturbation. We report R1@0.7 for a stricter localization threshold. On ID, ground-truth perturbation causes a 17.4% drop while random perturbation causes only 9.6%, a significant gap confirming the model attends to the target moment. On OOD, however, the two cause nearly identical drops (12.6% vs. 12.1%), revealing that the model does not attend to the actual visual content under distribution shift. (d) Domain gap. MMD distance gretton2012kernel of our slot-based representations (0.097) is substantially lower than the baseline (0.192), showing that object-centric decomposition reduces the domain gap between source and target distributions.
  • Figure 3: (a) Overview of SlotVTG. Video frames are encoded into visual tokens and projected into the LLM decoder. In the early decoder layers, a lightweight Slot Adapter decomposes visual tokens into entity-level slots via iterative slot attention, then reconstructs the token sequence. The resulting tokens carry disentangled, entity-aware representations before entering the later layers, which are fine-tuned with LoRA for temporal reasoning and answer generation. Text tokens bypass the Slot Adapter throughout. (b) Slot Alignment Loss. Token-pair similarity derived from slot attention weights is aligned with that from a pre-trained DINOv2 model, encouraging semantically coherent tokens to be grouped into the same slot.
  • Figure 4: Slot attention visualization. We visualize the slot assignments on samples from Cha. (ID), QVH. (OOD), and ANet (OOD) by masking each frame with its highest-attending slot. Each column corresponds to one of the four learned slots.