Table of Contents
Fetching ...

HourVideo: 1-Hour Video-Language Understanding

Keshigeyan Chandrasegaran, Agrim Gupta, Lea M. Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, Li Fei-Fei

TL;DR

<3-5 sentence high-level summary>

Abstract

We present HourVideo, a benchmark dataset for hour-long video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are available at https://hourvideo.stanford.edu

HourVideo: 1-Hour Video-Language Understanding

TL;DR

<3-5 sentence high-level summary>

Abstract

We present HourVideo, a benchmark dataset for hour-long video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are available at https://hourvideo.stanford.edu

Paper Structure

This paper contains 24 sections, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Example MCQs from HourVideo for different tasks. The correct answers are underlined.
  • Figure 2: Our dataset generation pipeline. We develop a dataset generation pipeline consisting of five stages to create HourVideo. We leverage over 800 hours of human effort in total corresponding to Video curation (Stage 1), MCQ Refinement using Human Feedback (Stage 3) and Expert MCQ Refinement (Stage 5) stages. We use LLMs for MCQ Generation (Stage 2), MCQ Refinement using Human Feedback (Stage 3) and Blind Filtering (Stage 4). We note that causal, counterfactual and navigation questions are manually generated by human experts (See Sec. \ref{['sec:dataset_generation_pipeline']} for details).
  • Figure 3: Dataset Statistics. ①: HourVideo includes 500 videos sourced from the Ego4D dataset, spanning 77 everyday scenarios. The bar chart shows the top 20 scenarios. ②: We report the number of MCQs per task/sub-task. In total, there are 12,976 questions in HourVideo. ③: We show the distribution of video duration in HourVideo. The average duration of videos in HourVideo is 45.7 minutes, with 113 videos extending beyond one hour. ④: We show the distribution of number of MCQs per video. On average, each video contains 26 MCQs.
  • Figure 4: Comparison between different multimodal foundation models on HourVideo across different tasks/sub-tasks. We include human expert performance for summarization (83.3%), perception (82.3%), visual reasoning (83.3%) and navigation (86.7%) tasks. As one can observe, current multimodal models significantly lack long-form video-language understanding capabilities.
  • Figure B.1: This plot shows visual elements coverage vs. total number of narration tokens. We use collection of objects in ImageNet-21K, VisualGenome, Tencent1M and Places365 to quantify visual coverage. We use Tiktoken library to calculate the total number of tokens. We used Ego4D dataset grauman2022ego4d to perform this experiment.
  • ...and 8 more figures