Table of Contents
Fetching ...

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, Cordelia Schmid

TL;DR

MoReVQA introduces a training-free, three-stage modular framework (event parsing, grounding, reasoning) with an external memory to tackle videoQA. By decomposing planning and leveraging few-shot prompts, it achieves state-of-the-art results across four standard benchmarks and provides interpretable intermediate outputs. The work highlights the brittleness of single-stage planners and demonstrates how grounding-focused stages improve accuracy and robustness, with extensions to grounded QA and long-form captioning. This approach offers a practical, extensible path for interpretable multimodal reasoning in video comprehension.

Abstract

This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

TL;DR

MoReVQA introduces a training-free, three-stage modular framework (event parsing, grounding, reasoning) with an external memory to tackle videoQA. By decomposing planning and leveraging few-shot prompts, it achieves state-of-the-art results across four standard benchmarks and provides interpretable intermediate outputs. The work highlights the brittleness of single-stage planners and demonstrates how grounding-focused stages improve accuracy and robustness, with extensions to grounded QA and long-form captioning. This approach offers a practical, extensible path for interpretable multimodal reasoning in video comprehension.

Abstract

This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
Paper Structure (18 sections, 2 equations, 12 figures, 6 tables)

This paper contains 18 sections, 2 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: MoReVQA: a new multi-stage, modular reasoning model for videoQA. Prior work relies on either (a) black-box end-to-end models that are difficult to interpret, or (b) modular systems where an interpretable planning step (program generation) is done in a single, ungrounded stage. (i) In this work, we find that single-stage planning leads in practice to brittle behavior, underperforming a new simple baseline (JCEF) that captions frames and predicts an answer (with two modules from (b)). (ii) We then introduce our new MoReVQA method incorporating both modularity and multi-stage planning, providing interpretable, grounded planning and execution traces, while simultaneously delivering improvements in overall accuracy by effectively decomposing the underlying task complexity (still using consistent base models with (b)). Above: Q is question, V is video, A is answer.
  • Figure 2: A simple, strong baseline -- JCEF. Our proposed baseline consists of a zero-shot prompted vision-language model (VLM) which is used to caption $n$ uniformly sampled frames from a video ($n$ is all frames at 1FPS unless explicitly stated). These captions are then stored in an external memory, which is passed to a zero-shot prompted LLM that is used to answer a question about the video. We show that this baseline outperforms existing visual programming methods by a large margin and investigate ways to more effectively improve upon it in a modular, multistage manner.
  • Figure 3: Modular Reasoning for Video Question-Answering (MoReVQA). To address the limitations of single-stage planning LLMs, we propose a new multi-stage, modular method $M_{\text{multi-stage}}$ that decomposes planning and execution into three key steps, motivated by sub-tasks inherent to videoQA: (i) event parsing $M_1$, (ii) grounding $M_2$, and (iii) reasoning $M_3$. See Section \ref{['sec:method:multi-stage']} for additional details.
  • Figure 4: Example qualitative result of MoReVQA on NExT-QA. We observe that the intermediate outputs from our MoReVQA model are interpretable: event parsing stage parses key events from language, and other tool-use metadata. The grounding stage then determines which frames contain the 'cat lying on its back', and the reasoning stage reasons about relevant sub-questions for the final answer, which when combined with general video-level context (subset of frame captions), gives us the final correct answer. We observe that JCEF and ViperGPT+ fail to predict correct answer for the same example (Sec. \ref{['sec:results:analysis']}); we provide more examples and analysis in the supplement \ref{['sec_supp:additional_qual']}.
  • Figure A1: Event parser statistics of MoReVQA. We observe that the event parser of our (training-free) MoReVQA method naturally identifies characteristics of the underlying dataset distribution (here, NExT-QA) and this corresponds to appropriate API usage downstream in a well-correlated manner (without explicit supervision). Note that since our API is designed to work consistently across all datasets/settings (i.e. the prediction values are not exactly the same as the dataset-specific metadata), we are reporting an approximate categorization of our predictions vs. the dataset-specific metadata. The graph shows the correlation between our system (x-axis) and the dataset (y-axis), where the diagonal represents an ideal 1:1 mapping.
  • ...and 7 more figures