Table of Contents
Fetching ...

EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models

Wenhao Xu, Xin Dong, Yue Li, Haoyuan Shi, Zhiwei Xiong

TL;DR

This work tackles the high computational cost of Video-LLMs caused by long token sequences in extended videos. It introduces EventSTU, a training-free, event-guided framework that jointly reduces temporal redundancy via Coarse-to-Fine Sampling (C2FS) and compresses spatial tokens via Zero-cost Adaptive Pruning (ZAP), leveraging event density and attention-based saliency, and extends to simulated events for general video understanding. A new EventBench dataset with real event streams and human annotations enables robust evaluation of event-assisted reasoning. Empirical results show substantial efficiency gains (e.g., $3.01\times$ FLOPs reduction and $3.10\times$ prefilling speedup) while improving accuracy, demonstrating practical potential for real- and simulated-event video understanding without model training.

Abstract

Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01x FLOPs reduction and 3.10x prefilling speedup over the strongest baseline while still improving performance.

EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models

TL;DR

This work tackles the high computational cost of Video-LLMs caused by long token sequences in extended videos. It introduces EventSTU, a training-free, event-guided framework that jointly reduces temporal redundancy via Coarse-to-Fine Sampling (C2FS) and compresses spatial tokens via Zero-cost Adaptive Pruning (ZAP), leveraging event density and attention-based saliency, and extends to simulated events for general video understanding. A new EventBench dataset with real event streams and human annotations enables robust evaluation of event-assisted reasoning. Empirical results show substantial efficiency gains (e.g., FLOPs reduction and prefilling speedup) while improving accuracy, demonstrating practical potential for real- and simulated-event video understanding without model training.

Abstract

Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01x FLOPs reduction and 3.10x prefilling speedup over the strongest baseline while still improving performance.

Paper Structure

This paper contains 16 sections, 4 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Top: A sunburt and toy example of the proposed EventBench. Bottom: Statistics of EventBench including the distributions of video lengths and capture platforms.
  • Figure 2: Overview of EventSTU. It processes long videos through a sequential, multi-stage pipeline. First, coarse-to-fine sampling efficiently filters redundant frames based on event density and retrieves question-relevant keyframes. Subsequently, physics-aware pruning selects tokens with high event saliency, while semantic-aware pruning further distills them to the most semantically crucial tokens using attention scores. This entire process culminates in a compact yet semantically dense visual representation, tailored for LLM inference.
  • Figure 3: Visual saliency of event data.Left: The large, uninformative sky and roads trigger few events. Right: Running athletes that attract more user interest trigger significantly more events.
  • Figure 4: Visualization results on EventBench. "LLaVA-OV" represents the original model without our method. It uses uniform sampling and misses a keyframe. In contrast, our method captures all keyframes and prunes uninformative areas, highlighting the speed limit signs.
  • Figure 5: Latency Analysis. "Other" indicates token pre-processing time. Our ZAP achieves a 66.3% reduction in Time-To-First-Token (TTFT) compared to the original model, outperforming all other token pruning methods.
  • ...and 2 more figures