Table of Contents
Fetching ...

Long Story Short: Story-level Video Understanding from 20K Short Films

Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, Ivan Laptev

TL;DR

SF20K introduces Short-Films 20K, the largest publicly accessible corpus of 20,143 amateur short films aimed at long-term story-level video understanding. The dataset supports two story-level QA tasks (MCQA and OEQA) generated with LLMs and curated for test quality, and demonstrates low data leakage compared to movie datasets. The authors show that current vision-language models lag behind humans on these tasks, and that a longer temporal context is essential to answer questions. They also demonstrate that instruction-tuning on SF20K-Train improves model performance, advocating SF20K as a valuable benchmark for advancing long-form video understanding.

Abstract

Recent developments in vision-language models have significantly advanced video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often depict activities of one person in a single scene. Although existing movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos, and frequently encounter data leakage issues given the use of subtitles and other information about commercial movies during LLM pretraining. To address the above limitations, we propose Short-Films 20K (SF20K), the largest publicly available movie dataset. SF20K is composed of 20,143 amateur films and offers long-term video tasks in the form of multiple-choice and open-ended question answering. Our extensive analysis of SF20K reveals minimal data leakage, emphasizes the need for long-term reasoning, and demonstrates the strong performance of recent VLMs. Finally, we show that instruction tuning on the SF20K-Train set substantially improves model performance, paving the way for future progress in long-term video understanding.

Long Story Short: Story-level Video Understanding from 20K Short Films

TL;DR

SF20K introduces Short-Films 20K, the largest publicly accessible corpus of 20,143 amateur short films aimed at long-term story-level video understanding. The dataset supports two story-level QA tasks (MCQA and OEQA) generated with LLMs and curated for test quality, and demonstrates low data leakage compared to movie datasets. The authors show that current vision-language models lag behind humans on these tasks, and that a longer temporal context is essential to answer questions. They also demonstrate that instruction-tuning on SF20K-Train improves model performance, advocating SF20K as a valuable benchmark for advancing long-form video understanding.

Abstract

Recent developments in vision-language models have significantly advanced video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often depict activities of one person in a single scene. Although existing movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos, and frequently encounter data leakage issues given the use of subtitles and other information about commercial movies during LLM pretraining. To address the above limitations, we propose Short-Films 20K (SF20K), the largest publicly available movie dataset. SF20K is composed of 20,143 amateur films and offers long-term video tasks in the form of multiple-choice and open-ended question answering. Our extensive analysis of SF20K reveals minimal data leakage, emphasizes the need for long-term reasoning, and demonstrates the strong performance of recent VLMs. Finally, we show that instruction tuning on the SF20K-Train set substantially improves model performance, paving the way for future progress in long-term video understanding.
Paper Structure (29 sections, 16 figures, 16 tables)

This paper contains 29 sections, 16 figures, 16 tables.

Figures (16)

  • Figure 1: Examples of Video Question Answering (VideoQA) tasks across three domains: instructional videos, egocentric videos, and movies. While instructional and egocentric videos usually depict one or two people performing a single task, movies present time-extended stories with rich variety in terms of scenes, characters, and interactions.
  • Figure 2: Comparison of SF20K-Test to other video QA benchmarks. The circle size indicates the number of QA pairs in each dataset.
  • Figure 3: Data leakage. Zero-shot accuracy comparison across different language models on three benchmark datasets (LVU, MovieQA, and SF20K-Test). When given only the movie title, higher zero-shot accuracy in question-answering by LLMs indicates greater data leakage. LLMs are ranked by MMLU.
  • Figure 4: Samples in the SF20K dataset. SF20K features diverse genres, characterized by distinct visual styles and storytelling. Each movie comes with a concise, high-level description known as a logline.
  • Figure 5: Statistics of SF20K.
  • ...and 11 more figures