STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

Yueqian Wang; Yuxuan Wang; Kai Chen; Dongyan Zhao

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

Yueqian Wang, Yuxuan Wang, Kai Chen, Dongyan Zhao

TL;DR

STAIR targets temporal reasoning challenges in video QA by leveraging an auditable neural module network that decomposes questions into sub-tasks via a program generator and executes them with 16 lightweight modules. Intermediate supervision enforces accuracy of module outputs, enabling interpretable reasoning and serving as prompts to enhance pre-trained model performance. Experiments on AGQA and AGQA2 show strong results and improved explainability, while STAR and MSRVTT-QA demonstrate generalization to datasets without program annotations and synergy with pre-trained models. The work highlights modularity and interpretability as practical benefits for scalable, temporally aware video understanding, with potential future extensions in stronger module design and broader tasks.

Abstract

Recently we have witnessed the rapid development of video question answering models. However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos. To tackle this problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering. STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks, and a set of lightweight neural modules to complete each of these sub-tasks. Though neural module networks are already widely studied on image-text tasks, applying them to videos is a non-trivial task, as reasoning on videos requires different abilities. In this paper, we define a set of basic video-text sub-tasks for video question answering and design a set of lightweight modules to complete them. Different from most prior works, modules of STAIR return intermediate outputs specific to their intentions instead of always returning attention maps, which makes it easier to interpret and collaborate with pre-trained models. We also introduce intermediate supervision to make these intermediate outputs more accurate. We conduct extensive experiments on several video question answering datasets under various settings to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available. Code: https://github.com/yellow-binary-tree/STAIR

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

TL;DR

Abstract

Paper Structure (22 sections, 8 equations, 3 figures, 8 tables)

This paper contains 22 sections, 8 equations, 3 figures, 8 tables.

Introduction
Related Works
Video Question Answering.
Neural Module Networks.
Methodology
Neural Modules
Programs and the Program Generator
Intermediate Supervision
Training Procedures
Experiments
Model Implementations
Implementation.
Baselines.
Model Performance
Evaluation and Visualization of Modules' Intermediate Output
...and 7 more sections

Figures (3)

Figure 1: Overview of STAIR.
Figure 2: A Diagram of Intermediate Supervision.
Figure 3: Examples of a successful case (left) and a failing case (right). In the first case, the person is touching things throughout the video, so the ExistsFrame module returns a uniform distribution on all the frames. The last 2 things the person touches are phone and tissue, though Filter module only finds one correct answer "phone", but as it is not equal to the choice "table", so Equals module returns the correct final answer "No". In the second case, Localize module successfully finds when the person is taking some clothes and ExistsFrame module successfully finds when the person is on the side of something, but Filter module fails to recognize the exact thing that is on the side of the person (probably due to low video quality and the pillow is blocked by the body). Outputs of FilterFrame modules are too complex to be visualized.

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

TL;DR

Abstract

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (3)