Table of Contents
Fetching ...

Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization

Zongshang Pang, Mayu Otani, Yuta Nakashima

TL;DR

MeCo tackles video temporal localization by eliminating boundary timestamp generation and instead leveraging the video LLM’s semantic capacity through structural tokens and query-focused captioning. It introduces two tokens, <ent> and <tst>, to capture holistic event and transition structure, and grounds these tokens with a contrastive loss $\\mathcal{L}_{\\text{ST}}$ that aligns frame embeddings with token representations via a temperature-scaled probability $ p(\\mathbf{h}_t|\\mathbf{s}_i) $. To enrich event semantics, MeCo adds a query-focused captioning task that generates detailed captions for queried segments before emitting the corresponding <ent> token, optimizing the joint objective $ \\mathcal{L}_{\\text{ST}} + \\mathcal{L}_{\\text{LM}}(\\mathbf{X}_{\\text{MeCo}}) $. Across ET-Bench, Charades-STA, and QVHighlights, MeCo outperforms boundary-centric localization methods, highlighting the value of semantic-driven localization and structured representations in video LLMs.

Abstract

Localizing user-queried events through natural language is crucial for video understanding models. Recent methods predominantly adapt Video LLMs to generate event boundary timestamps to handle temporal localization tasks, which struggle to leverage LLMs' powerful semantic understanding. In this work, we introduce MeCo, a novel timestamp-free framework that enables video LLMs to fully harness their intrinsic semantic capabilities for temporal localization tasks. Rather than outputting boundary timestamps, MeCo partitions videos into holistic event and transition segments based on the proposed structural token generation and grounding pipeline, derived from video LLMs' temporal structure understanding capability. We further propose a query-focused captioning task that compels the LLM to extract fine-grained, event-specific details, bridging the gap between localization and higher-level semantics and enhancing localization performance. Extensive experiments on diverse temporal localization tasks show that MeCo consistently outperforms boundary-centric methods, underscoring the benefits of a semantic-driven approach for temporal localization with video LLMs.

Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization

TL;DR

MeCo tackles video temporal localization by eliminating boundary timestamp generation and instead leveraging the video LLM’s semantic capacity through structural tokens and query-focused captioning. It introduces two tokens, <ent> and <tst>, to capture holistic event and transition structure, and grounds these tokens with a contrastive loss that aligns frame embeddings with token representations via a temperature-scaled probability . To enrich event semantics, MeCo adds a query-focused captioning task that generates detailed captions for queried segments before emitting the corresponding <ent> token, optimizing the joint objective . Across ET-Bench, Charades-STA, and QVHighlights, MeCo outperforms boundary-centric localization methods, highlighting the value of semantic-driven localization and structured representations in video LLMs.

Abstract

Localizing user-queried events through natural language is crucial for video understanding models. Recent methods predominantly adapt Video LLMs to generate event boundary timestamps to handle temporal localization tasks, which struggle to leverage LLMs' powerful semantic understanding. In this work, we introduce MeCo, a novel timestamp-free framework that enables video LLMs to fully harness their intrinsic semantic capabilities for temporal localization tasks. Rather than outputting boundary timestamps, MeCo partitions videos into holistic event and transition segments based on the proposed structural token generation and grounding pipeline, derived from video LLMs' temporal structure understanding capability. We further propose a query-focused captioning task that compels the LLM to extract fine-grained, event-specific details, bridging the gap between localization and higher-level semantics and enhancing localization performance. Extensive experiments on diverse temporal localization tasks show that MeCo consistently outperforms boundary-centric methods, underscoring the benefits of a semantic-driven approach for temporal localization with video LLMs.

Paper Structure

This paper contains 13 sections, 5 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: As opposed to previous semantic-poor boundary-centric approaches ren2024timechathuang2025litaguo2024vtgliu2024benchhuang2024vtimellmguo2024trace, MeCo leverages video LLMs to capture the temporal structure and segment the video into transition and event segments. We also tune the model to perform query-focused captioning to scrutinize the detailed event semantics for more precise localization.
  • Figure 2: An overview of the proposed MeCo framework. Given an input video and a localization-aware user prompt, MeCo generates structural tokens, including the event token <ent> and the transition token <tst>, to facilitate holistic temporal segmentation via structural token grounding. MeCo also generates query-focused captions, right before generating the <ent> token, to retrieve the semantic details in the queried segments for improving structural token-based localization performance.
  • Figure 3: Visualizations of MeCo's temporal localization results.
  • Figure 4: Query-focused captioning pipeline and examples.
  • Figure 5: Evaluation prompt templates.
  • ...and 1 more figures