Table of Contents
Fetching ...

Two Causally Related Needles in a Video Haystack

Miaoyu Li, Qin Chao, Boyang Li

TL;DR

Causal2Needles introduces a long-context video benchmark that targets two core capabilities: extracting and jointly reasoning about information from two distant video locations and modeling human-behavior causality. It combines 1-needle and 2-needle questions with complementary visual-grounding and image-description formats to combat textual bias, enabling diagnostic evaluation of world-modeling in VLMs. Empirical results show causal and multi-needle reasoning remain major challenges, with performance deteriorating as needle distance grows and open-source models lagging proprietary systems in world-modeling ability. The work highlights significant gaps in current VLMs and provides a publicly available dataset to spur progress in long-context video understanding and causal reasoning.

Abstract

Properly evaluating the ability of Video-Language Models (VLMs) to understand long videos remains a challenge. We propose a long-context video understanding benchmark, Causal2Needles, that assesses two crucial abilities insufficiently addressed by existing benchmarks: (1) extracting information from two separate locations (two needles) in a long video and understanding them jointly, and (2) modeling the world in terms of cause and effect in human behaviors. Causal2Needles evaluates these abilities using noncausal one-needle, causal one-needle, and causal two-needle questions. The most complex question type, causal two-needle questions, require extracting information from both the cause and effect events from a long video and the associated narration text. To prevent textual bias, we introduce two complementary question formats: locating the video clip containing the answer, and verbal description of a visual detail from that video clip. Our experiments reveal that models excelling on existing benchmarks struggle with causal 2-needle questions, and the model performance is negatively correlated with the distance between the two needles. These findings highlight critical limitations in current VLMs. The dataset is available at: https://huggingface.co/datasets/causal2needles/Causal2Needles

Two Causally Related Needles in a Video Haystack

TL;DR

Causal2Needles introduces a long-context video benchmark that targets two core capabilities: extracting and jointly reasoning about information from two distant video locations and modeling human-behavior causality. It combines 1-needle and 2-needle questions with complementary visual-grounding and image-description formats to combat textual bias, enabling diagnostic evaluation of world-modeling in VLMs. Empirical results show causal and multi-needle reasoning remain major challenges, with performance deteriorating as needle distance grows and open-source models lagging proprietary systems in world-modeling ability. The work highlights significant gaps in current VLMs and provides a publicly available dataset to spur progress in long-context video understanding and causal reasoning.

Abstract

Properly evaluating the ability of Video-Language Models (VLMs) to understand long videos remains a challenge. We propose a long-context video understanding benchmark, Causal2Needles, that assesses two crucial abilities insufficiently addressed by existing benchmarks: (1) extracting information from two separate locations (two needles) in a long video and understanding them jointly, and (2) modeling the world in terms of cause and effect in human behaviors. Causal2Needles evaluates these abilities using noncausal one-needle, causal one-needle, and causal two-needle questions. The most complex question type, causal two-needle questions, require extracting information from both the cause and effect events from a long video and the associated narration text. To prevent textual bias, we introduce two complementary question formats: locating the video clip containing the answer, and verbal description of a visual detail from that video clip. Our experiments reveal that models excelling on existing benchmarks struggle with causal 2-needle questions, and the model performance is negatively correlated with the distance between the two needles. These findings highlight critical limitations in current VLMs. The dataset is available at: https://huggingface.co/datasets/causal2needles/Causal2Needles

Paper Structure

This paper contains 42 sections, 22 figures, 5 tables.

Figures (22)

  • Figure 1: The logical solution process for the 2-needle questions of Causal2Needles. Each step involves an operation, with the input shown above the step and the output below the step. The question purposely refer to the bridge entity, "Superman's death," ambiguously as "tragedy." As a result, one must first resolve the bridge entity using Part 1 before answering Part 2. This question design mandates joint understanding of both the cause and effect events. Note that the steps are necessary only in an information-processing sense. A VLM may adopt different steps.
  • Figure 2: The evaluation framework of Causal2Needles. To help models understand the storyline, we also feed the full textual narration into the model. Four types of questions are designed for each pair of causally related events.
  • Figure 3: Comparison of performance of models on Part 2 of VG 2-needle questions (average order) and causal 1-needle questions. This is a fair comparison since both ask to retrieve the cause event. The dashed diagonal line represents equal performance across the two question types. Most points fall below the line, indicating that models are poorer at VG 2-needle questions than causal 1-needle questions. The size of the dots indicates the model size. The stars indicate proprietary models.
  • Figure 4: Performance on VG 2-needle questions as the distance between the two needle grows. We report the average-order performance for models and the forward-order performance for human annotators. The model performance declines whereas human performance stays mostly unchanged.
  • Figure 5: The answer distribution of various models in the forward evaluation of visual grounding 2-needle questions. GT denotes ground truth. None means no clip number is output. Predictions of opensource models are heavily concentrated in a few numbers, exhibiting significant bias.
  • ...and 17 more figures