Table of Contents
Fetching ...

Neptune: The Long Orbit to Benchmarking Long Video Understanding

Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hornung, Nitesh Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong, Cordelia Schmid, Mikhail Sirotenko, Yukun Zhu, Tobias Weyand

TL;DR

Neptune introduces a scalable benchmark for long-video understanding that requires multimodal reasoning over extended timelines. It pairs a semi-automatic data-generation pipeline—utilizing Video Language Models, Large Language Models, and a lightweight human verification stage—with a new open-ended evaluation metric GEM to assess free-form answers. The dataset comprises 2,405 videos and 3,268 QADs (with a challenging Neptune-MMH subset) and demonstrates significant performance gaps between open-source and proprietary models, especially on temporal and counting tasks. By releasing Neptune and GEM, the work aims to spur the development of models capable of robust long-form video comprehension in real-world domains.

Abstract

We introduce Neptune, a benchmark for long video understanding that requires reasoning over long time horizons and across different modalities. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at https://github.com/google-deepmind/neptune

Neptune: The Long Orbit to Benchmarking Long Video Understanding

TL;DR

Neptune introduces a scalable benchmark for long-video understanding that requires multimodal reasoning over extended timelines. It pairs a semi-automatic data-generation pipeline—utilizing Video Language Models, Large Language Models, and a lightweight human verification stage—with a new open-ended evaluation metric GEM to assess free-form answers. The dataset comprises 2,405 videos and 3,268 QADs (with a challenging Neptune-MMH subset) and demonstrates significant performance gaps between open-source and proprietary models, especially on temporal and counting tasks. By releasing Neptune and GEM, the work aims to spur the development of models capable of robust long-form video comprehension in real-world domains.

Abstract

We introduce Neptune, a benchmark for long video understanding that requires reasoning over long time horizons and across different modalities. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at https://github.com/google-deepmind/neptune

Paper Structure

This paper contains 49 sections, 1 equation, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Pipeline Overview: Our pipeline consists of 5 key stages - (i) Video selection, where suitable videos are identified from YouTube, (ii) Signal extraction, (iii) Video level captioning, (iv) Question, answer and decoy (QAD) generation and (v) Manual rater verification. The first four stages are entirely automatic. Before rater verification, we automatically filter out QADs that can be solved by an LLM without access to the video content.
  • Figure 1: Evaluation of open-ended metrics on the GEM answer equivalence dev set. FT: Fine-tuning
  • Figure 2: Examples from Neptune: We show examples from the dataset that highlight key question types from our dataset. We show 2 frames from each video. Correct answer is provided in green and decoys are shown in red. Best viewed zoomed in and in colour. Some decoys are summarised for brevity.
  • Figure 3: Neptune Statistics: We show, the distribution of video lengths (top, left), the number of questions per question type (top, right), the distribution question and answer lengths (bottom, left and middle) and the domains in Neptune (bottom, right). Note that greater than 12% of the videos are longer than 5 minutes (305) and over 25% are longer than 3 minutes. An expanded plot of the video domains is provided in the appendix.
  • Figure 4: Performance of different models across question types on Neptune-Full (left) and Neptune Vs Egoschema with different frame rates (right). On the right we show Gemini 1.5 Pro’s accuracy when linearly subsampling to 1, 16 or 150 frames. We note that (i) performance on the Neptune sets increases as more frames are provided while on EgoSchema it saturates after 16 frames and (ii) Neptune-MMH is more challenging than EgoSchema. We included additional comparisons to other datasets in the appendix in Sec. \ref{['sec:appendix:dataset_comparison']}.
  • ...and 11 more figures