Narrative Aligned Long Form Video Question Answering

Rahul Jain; Keval Doshi; Burak Uzkent; Garin Kessler

Narrative Aligned Long Form Video Question Answering

Rahul Jain, Keval Doshi, Burak Uzkent, Garin Kessler

Abstract

Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Extensive experiments show that state-of-the-art MLLMs perform poorly on questions requiring far-range evidence, highlighting the need for explicit narrative modeling. Video-NaRA improves long-range reasoning performance by up to 3 percent, demonstrating its effectiveness in handling complex narrative structures. We will release NA-VQA upon publication.

Narrative Aligned Long Form Video Question Answering

Abstract

Paper Structure (30 sections, 6 equations, 4 figures, 4 tables)

This paper contains 30 sections, 6 equations, 4 figures, 4 tables.

Introduction
Related Work
Long Form Video Reasoning Benchmark
MLLMs on Long-Form Videos
Memory based architecture
Constructing NA-VQA Dataset
Raw video collection and related data.
Event Extraction and Structuring
Question and Answer Generation
Automatic Data filtering
QA Validator
QA Refiner
Dataset Statistics
Proposed Method (Video-NaRA)
Narrative Memory Construction
...and 15 more sections

Figures (4)

Figure 1: Illustration of NA-VQA sample. NA-VQA consists of a question from a full movie, the key evidence scenes collected from different parts of the timeline, and the final answer. This example shows how NA-VQA tests a model’s ability to connect scattered events and recover the full sequence of what happened.
Figure 2: Video-NaRA pipeline. We begin by generating a short description for every clip using Qwen-VL 2.5 (7B). Using both the clip and its description, the model groups clips into narrative slots and builds a structured narrative memory. Given a query, the retrieval module selects the most relevant clips, which are then passed to a fine-tuned reasoner (Qwen-VL 2.5 7B) to produce the final answer.
Figure 3: NA-VQA dataset creation pipeline. We start with raw movie data, extract event-level descriptions, generate an initial set of QA pairs using LLM, validate them through deeper analysis, and finally refine the QA outputs using an LLM-based refiner. This multi-stage process ensures high-quality, narrative-grounded VQA annotations for long-form movie videos.
Figure 4: Comprehensive question analysis showing: (a) distribution by scene distance categories (short, medium, far), (b) distribution by number of evidence scenes required per question, and (c) distribution by reasoning type (Narrative, causal, theme etc.).

Narrative Aligned Long Form Video Question Answering

Abstract

Narrative Aligned Long Form Video Question Answering

Authors

Abstract

Table of Contents

Figures (4)