Table of Contents
Fetching ...

NeMo: Needle in a Montage for Video-Language Understanding

Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, Meng Fang, Yin Li, Liwei Wang

TL;DR

NeMo introduces a novel Needle in a Montage task to stress-test VideoLLMs on long, temporally rich videos. It couples a scalable automated data-generation pipeline with NeMoBench, featuring up-to-date authorized content and two benchmark variants (Full and Clean), to enable continuous, large-scale evaluation. Across 20 state-of-the-art models, open-source systems show substantial gaps relative to closed-source models, especially on long montages, highlighting the need for improved long-context recall and temporal grounding. The work delivers a practical, scalable framework for robust multimodal evaluation and points to directions for enhancing open-source VideoLLMs.

Abstract

Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.

NeMo: Needle in a Montage for Video-Language Understanding

TL;DR

NeMo introduces a novel Needle in a Montage task to stress-test VideoLLMs on long, temporally rich videos. It couples a scalable automated data-generation pipeline with NeMoBench, featuring up-to-date authorized content and two benchmark variants (Full and Clean), to enable continuous, large-scale evaluation. Across 20 state-of-the-art models, open-source systems show substantial gaps relative to closed-source models, especially on long montages, highlighting the need for improved long-context recall and temporal grounding. The work delivers a practical, scalable framework for robust multimodal evaluation and points to directions for enhancing open-source VideoLLMs.

Abstract

Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.

Paper Structure

This paper contains 33 sections, 9 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: Illustration of our Needle in a Montage (NeMo) task, showcasing examples of object and scene needles in an hour-long montage that is synthesized by numerous loosely related short video clips (see more details in Sec. \ref{['sec:task_suite']}).
  • Figure 2: Comparisons between our NeMoBench and recent VideoLLM benchmarks. Manual/Auto: raw QA pairs are constructed via manual annotation or automated data generation. Short: less than 2.5 minutes. Medium: between 2.5 and 15 minutes. Long: more than 15 minutes. Single/Multi: temporal grounding QA pairs with single or multiple targets. $\spadesuit$: requires manually annotated data from existing datasets. $\heartsuit$: a subset of tasks requires purely manual annotation. Official Authorization: ✓ indicates that all videos are collected with direct authorization from official platforms to ensure long-term usability (see Sec. \ref{['sec:pipeline']}); ✓$^\triangle$ denotes that the authorized videos are derived from Ego4D grauman2022ego4d; ✗ indicates that the videos are crawled from public platforms (e.g., YouTube and ShutterStock).
  • Figure 3: Comparisons between our NeMoBench (Sec. \ref{['sec:benchmark']}) and other widely-used VideoLLM benchmarks videommevideohaystack. Our benchmark is centered on the proposed NeMo task, specifically designed to probe into critical reasoning capabilities in long video understanding, including long-context recall and temporal grounding.
  • Figure 5: Prompt design on NeMoBench.
  • Figure 6: Effects of montage duration on NeMoBench-Clean.
  • ...and 8 more figures