Table of Contents
Fetching ...

ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting

Yeonkyung Lee, Dayun Ju, Youngmin Kim, Seil Kang, Seong Jae Hwang

Abstract

Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.

ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting

Abstract

Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.
Paper Structure (67 sections, 16 equations, 14 figures, 14 tables, 1 algorithm)

This paper contains 67 sections, 16 equations, 14 figures, 14 tables, 1 algorithm.

Figures (14)

  • Figure 1: Illustrations of How Frame Selection Disrupts and Visual Prompting Restores Temporal Reasoning. Frame selection improves efficiency but breaks temporal continuity, causing VideoLLMs to misinterpret transitions. With frame-index visual prompts, the model regains temporal order and correctly identifies causal relations under sparse-frame conditions.
  • Figure 2: Probing VP for temporal understanding. (a) Positional Embedding Degradation: remove temporal order only, or collapse both temporal and spatial positions. (b) Frame-level Referencing in VideoLLMs: frame-index prompts enable lookup and reverse-lookup between indices and content. (c) Attention Analysis: VP increases attention to image tokens across layers.
  • Figure 3: Effect of VP under Position Degradation. Across both baselines, frame-index VP consistently improve accuracy under temporal-only and full-collapse settings, indicating added temporal cues and robustness to degraded positional signals.
  • Figure 4: Overall pipeline of ViKey. Given a sequence of video frames, we first apply visual prompting by overlaying frame-number VPs on each frame. For a user query, the system extracts key textual concepts and performs Keyword–Frame Mapping (KFM) to identify which frames best correspond to each keyword. The query is then rewritten to explicitly include the mapped frame indices, aligning textual cues with the numbered frames. Feeding this aligned query–frame pair into the VLM enables more accurate temporal understanding.
  • Figure 5: Qualitative Results on Open-Ended Temporal Questions. Compared with the baseline LLaVA-Video-7B, ViKey uses explicit frame indices and keyword–frame mapping to better capture temporal order and thus produce more accurate answers to open-ended temporal queries. In particular, it often explicitly mentions the injected frame indices (e.g., "frame #01", "frame #02") in its free-form responses, indicating that its temporal reasoning is grounded on the visual prompts.
  • ...and 9 more figures