Table of Contents
Fetching ...

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen

TL;DR

E.T. Bench introduces a large-scale benchmark for open-ended event-level and time-sensitive video understanding, addressing gaps in existing benchmarks that emphasize short clips or video-level QA. The authors define a 3-level task taxonomy spanning referring, grounding, dense captioning, and complex understanding, and collect 7,289 samples (7,002 videos) from 15 datasets across 8 domains, with a rigorous annotation pipeline. To tackle the observed weaknesses of current models in handling multi-event time information, they propose E.T. Chat, an embedding-matching based timestamp predictor, and E.T. Instruct 164K, a multi-event instruction-tuning dataset. Experimental results show that open-source Image-/Video-LLMs lag behind specialized time-sensitive models, while E.T. Chat achieves strong open-source performance and competes with commercial MLLMs, underscoring the benchmark’s value for driving improvements in fine-grained, time-aware video-language understanding.

Abstract

Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e.g., grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data. Focusing on these issues, we further propose a strong baseline model, E.T. Chat, together with an instruction-tuning dataset E.T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios.

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

TL;DR

E.T. Bench introduces a large-scale benchmark for open-ended event-level and time-sensitive video understanding, addressing gaps in existing benchmarks that emphasize short clips or video-level QA. The authors define a 3-level task taxonomy spanning referring, grounding, dense captioning, and complex understanding, and collect 7,289 samples (7,002 videos) from 15 datasets across 8 domains, with a rigorous annotation pipeline. To tackle the observed weaknesses of current models in handling multi-event time information, they propose E.T. Chat, an embedding-matching based timestamp predictor, and E.T. Instruct 164K, a multi-event instruction-tuning dataset. Experimental results show that open-source Image-/Video-LLMs lag behind specialized time-sensitive models, while E.T. Chat achieves strong open-source performance and competes with commercial MLLMs, underscoring the benchmark’s value for driving improvements in fine-grained, time-aware video-language understanding.

Abstract

Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e.g., grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data. Focusing on these issues, we further propose a strong baseline model, E.T. Chat, together with an instruction-tuning dataset E.T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios.
Paper Structure (32 sections, 4 equations, 15 figures, 25 tables)

This paper contains 32 sections, 4 equations, 15 figures, 25 tables.

Figures (15)

  • Figure 1: Task definitions in E.T. Bench. The 12 tasks derives from 4 essential capabilities for time-sensitive video understanding: referring, grounding, dense captioning, and complex understanding.
  • Figure 2: Left: Task taxonomy and sample distribution. Right: Generation pipeline for E.T. Bench. We conduct a thorough process of pre-filtering, annotation repurposing, instruction writing, manual check, and sampling to obtain high-quality fine-grained annotations. Details discussed in Section \ref{['sec:benchmark']}.
  • Figure 3: Left: Word cloud of text queries shows a considerable degree of diversity. Right: Distribution of averaged video durations (in seconds) across 12 tasks.
  • Figure 4: Overall architecture of E.T. Chat. We reformulate timestamp prediction as an embedding matching problem. See Section \ref{['sec:model']} for details.
  • Figure 5: Detailed illustration of frame compressor. It accepts video patch embeddings $\mathbf{P}_t$ and the text prompt $\mathbf{T}$ as inputs, and compress video frame features into a single token.
  • ...and 10 more figures