Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization

Zongshang Pang; Mayu Otani; Yuta Nakashima

Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization

Zongshang Pang, Mayu Otani, Yuta Nakashima

TL;DR

MeCo tackles video temporal localization by eliminating boundary timestamp generation and instead leveraging the video LLM’s semantic capacity through structural tokens and query-focused captioning. It introduces two tokens, <ent> and <tst>, to capture holistic event and transition structure, and grounds these tokens with a contrastive loss $\\mathcal{L}_{\\text{ST}}$ that aligns frame embeddings with token representations via a temperature-scaled probability $ p(\\mathbf{h}_t|\\mathbf{s}_i) $. To enrich event semantics, MeCo adds a query-focused captioning task that generates detailed captions for queried segments before emitting the corresponding <ent> token, optimizing the joint objective $ \\mathcal{L}_{\\text{ST}} + \\mathcal{L}_{\\text{LM}}(\\mathbf{X}_{\\text{MeCo}}) $. Across ET-Bench, Charades-STA, and QVHighlights, MeCo outperforms boundary-centric localization methods, highlighting the value of semantic-driven localization and structured representations in video LLMs.

Abstract

Localizing user-queried events through natural language is crucial for video understanding models. Recent methods predominantly adapt Video LLMs to generate event boundary timestamps to handle temporal localization tasks, which struggle to leverage LLMs' powerful semantic understanding. In this work, we introduce MeCo, a novel timestamp-free framework that enables video LLMs to fully harness their intrinsic semantic capabilities for temporal localization tasks. Rather than outputting boundary timestamps, MeCo partitions videos into holistic event and transition segments based on the proposed structural token generation and grounding pipeline, derived from video LLMs' temporal structure understanding capability. We further propose a query-focused captioning task that compels the LLM to extract fine-grained, event-specific details, bridging the gap between localization and higher-level semantics and enhancing localization performance. Extensive experiments on diverse temporal localization tasks show that MeCo consistently outperforms boundary-centric methods, underscoring the benefits of a semantic-driven approach for temporal localization with video LLMs.

Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization

TL;DR

Abstract

Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)