Table of Contents
Fetching ...

UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

Joungbin An, Agrim Jain, Kristen Grauman

Abstract

Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG, but their high compute cost and limited video context still hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing the negative transfer observed under naïve joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks-GoalStep-StepGrounding, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions-one UniversalVTG checkpoint achieves state-of-the-art performance versus dedicated VTG models. Moreover, despite being $>100\times$ smaller than recent MLLM-based approaches, UniversalVTG matches or exceeds their accuracy on multiple benchmarks, offering a practical alternative to parameter-heavy MLLMs.

UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

Abstract

Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG, but their high compute cost and limited video context still hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing the negative transfer observed under naïve joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks-GoalStep-StepGrounding, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions-one UniversalVTG checkpoint achieves state-of-the-art performance versus dedicated VTG models. Moreover, despite being smaller than recent MLLM-based approaches, UniversalVTG matches or exceeds their accuracy on multiple benchmarks, offering a practical alternative to parameter-heavy MLLMs.

Paper Structure

This paper contains 30 sections, 2 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Universal without compromise: a single UniversalVTG checkpoint rivals dataset-specific SOTA.
  • Figure 2: UniversalVTG Framework. A single, lightweight model generalizes across heterogeneous video domains and query styles. (Left) Diverse videos are mapped to a Shared Representation via an efficient backbone, while a Query Unifier canonicalizes multi-style text inputs into a Unified Semantic Space. (Right) With both modalities standardized, one grounding head localizes events across varied viewpoints (ego/exo), durations (short/long), and linguistic forms (e.g., questions, declarations). UniversalVTG achieves real-time inference ($\sim$10 ms/query) suitable for long-form deployment.
  • Figure 3: Cross-Dataset Performance vs. Parameter Scale. UniversalVTG achieves performance parity with $100\times$ larger MLLMs while outperforming specialized expert models.
  • Figure S2: Qualitative results of failure cases.
  • Figure S3: Qualitative grounding results across five diverse VTG benchmarks. UniversalVTG accurately localizes temporal segments across varying camera perspectives, video durations, and linguistic query styles using a single unified model. Ground-truth (GT) and predicted segments (Ours) are denoted for each video.
  • ...and 1 more figures