Table of Contents
Fetching ...

E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching

Jiahao Nie, Wenbin An, Gongjie Zhang, Yicheng Xu, Yap-Peng Tan, Alex C. Kot, Shijian Lu

TL;DR

E.M.Ground tackles Temporal Video Grounding by replacing boundary-focused token matching with holistic event perception. It introduces a unified $<\!\!evt\!\!>$ token to aggregate all frames within the ground-truth event, augments representations with multi-grained visual features, and refines predictions via Savitzky-Golay smoothing. These design choices address semantic continuity, noise in token-frame similarities, and information loss from compression, yielding state-of-the-art results on Charades-STA and E.T.Bench with a relatively compact Phi-3 Mini-3.8B backbone. The approach demonstrates strong generalization across TVG, DVC, and VHD tasks, offering a practically efficient and robust solution for precise temporal localization in Vid-LLMs.

Abstract

Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event's semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a special <event> token that aggregates information from all frames of a query event, preserving semantic continuity for accurate event matching; (ii) Savitzky-Golay smoothing to reduce noise in token-to-frame similarities across timestamps, improving prediction accuracy; (iii) multi-grained frame feature aggregation to enhance matching reliability and temporal understanding, compensating for compression-induced information loss. Extensive experiments on benchmark datasets show that E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins.

E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching

TL;DR

E.M.Ground tackles Temporal Video Grounding by replacing boundary-focused token matching with holistic event perception. It introduces a unified token to aggregate all frames within the ground-truth event, augments representations with multi-grained visual features, and refines predictions via Savitzky-Golay smoothing. These design choices address semantic continuity, noise in token-frame similarities, and information loss from compression, yielding state-of-the-art results on Charades-STA and E.T.Bench with a relatively compact Phi-3 Mini-3.8B backbone. The approach demonstrates strong generalization across TVG, DVC, and VHD tasks, offering a practically efficient and robust solution for precise temporal localization in Vid-LLMs.

Abstract

Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event's semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a special <event> token that aggregates information from all frames of a query event, preserving semantic continuity for accurate event matching; (ii) Savitzky-Golay smoothing to reduce noise in token-to-frame similarities across timestamps, improving prediction accuracy; (iii) multi-grained frame feature aggregation to enhance matching reliability and temporal understanding, compensating for compression-induced information loss. Extensive experiments on benchmark datasets show that E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins.
Paper Structure (27 sections, 12 equations, 5 figures, 9 tables)

This paper contains 27 sections, 12 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: The state-of-the-art method E.T.Chat liu2024bench matches the start and end frames with two separate tokens, thereby neglecting the intermediate frames. Consequently, it performs worse on longer videos that contain richer information in the intermediate frames. In contrast, our proposed E.M.Ground models the holistic query event by a special token $<\!\!evt\!\!>$, which preserves the semantic continuity and integrity of the event, thus achieving better and stable performances on videos of varying lengths.
  • Figure 2: Overall architecture of our proposed E.M.Ground. Specificlly, E.M.Ground perceives the query event from a holistic and coherent perspective. It introduces a special token $<\!\!evt\!\!>$ to aggregate all frames within the ground-truth time spans, leverages multi-grained visual features to compensate for information loss, and refines the predictions with smoothing operations. Beased on these designs, E.M.Ground effectively captures the semantic continuity and integrity of the query event.
  • Figure 3: Qualitative Comparison with E.T.Chat liu2024bench. The timestamp matching mechanism in E.T.Chat encounters several failure cases. (a) No Overlap: the predicted segment has no overlap with the ground-truth. (b) Completely Contained: the ground-truth query boundary completely contains the predicted segment, and the prediction omits the start and end phases of the query event. (c) Partial Overlap: the prediction is generally inaccurate in temporal localization. In contrast, our proposed E.M.Ground effectively mitigates all of these errors, providing more accurate temporal grounding.
  • Figure 4: Error analysis of Temporal Video Grounding. Left: Number of errors for each type made by E.T.Chat liu2024bench and E.M.Ground. Right: mIoU corresponding to each error type. N.O. denotes predictions that have no overlap with the ground-truths; C.C. denotes cases where the ground-truths are completely contained within the predictions; P.O. denotes predictions are partial overlap with the ground-truths.
  • Figure 5: The $<\!\!evt\!\!>$ token and individual video frames exhibit distinct attention patterns across different layers of the LLM. Each block represents the attention corresponding to a specific frame.