Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering

Lili Liang; Guanglu Sun; Jin Qiu; Lizhong Zhang

Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering

Lili Liang, Guanglu Sun, Jin Qiu, Lizhong Zhang

TL;DR

This work tackles real-world VideoQA with compositional spatio-temporal questions by introducing NS-VideoQA, a neural-symbolic framework that first converts video content into a Symbolic Representation ($SR$) using a Scene Parser Network (SPN) and then performs top-down question decomposition and bottom-up program execution with a Symbolic Reasoning Machine (SRM). The SPN jointly extracts static scene information ($R_{static}(v)$) and dynamic action chronologies ($R_{dynamic}(v)$) to form $SR(v)=(R_{static}(v),R_{dynamic}(v))$, while SRM employs a polymorphic program executor to answer questions with traceable intermediate reasoning. Evaluated on the AGQA Decomp benchmark, NS-VideoQA outperforms purely neural VideoQA models in compositional accuracy (CA), right for the wrong reasons (RWR), and internal consistency (IC), demonstrating stronger capability in spatio-temporal and logical inference and enabling error analysis via execution traces. The framework thus advances interpretable, robust real-world video reasoning with a clear pathway to improving symbolic representations and rule-clarity for even more reliable multi-step question answering.

Abstract

Compositional spatio-temporal reasoning poses a significant challenge in the field of video question answering (VideoQA). Existing approaches struggle to establish effective symbolic reasoning structures, which are crucial for answering compositional spatio-temporal questions. To address this challenge, we propose a neural-symbolic framework called Neural-Symbolic VideoQA (NS-VideoQA), specifically designed for real-world VideoQA tasks. The uniqueness and superiority of NS-VideoQA are two-fold: 1) It proposes a Scene Parser Network (SPN) to transform static-dynamic video scenes into Symbolic Representation (SR), structuralizing persons, objects, relations, and action chronologies. 2) A Symbolic Reasoning Machine (SRM) is designed for top-down question decompositions and bottom-up compositional reasonings. Specifically, a polymorphic program executor is constructed for internally consistent reasoning from SR to the final answer. As a result, Our NS-VideoQA not only improves the compositional spatio-temporal reasoning in real-world VideoQA task, but also enables step-by-step error analysis by tracing the intermediate results. Experimental evaluations on the AGQA Decomp benchmark demonstrate the effectiveness of the proposed NS-VideoQA framework. Empirical studies further confirm that NS-VideoQA exhibits internal consistency in answering compositional questions and significantly improves the capability of spatio-temporal and logical inference for VideoQA tasks.

Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering

TL;DR

) using a Scene Parser Network (SPN) and then performs top-down question decomposition and bottom-up program execution with a Symbolic Reasoning Machine (SRM). The SPN jointly extracts static scene information (

) and dynamic action chronologies (

) to form

, while SRM employs a polymorphic program executor to answer questions with traceable intermediate reasoning. Evaluated on the AGQA Decomp benchmark, NS-VideoQA outperforms purely neural VideoQA models in compositional accuracy (CA), right for the wrong reasons (RWR), and internal consistency (IC), demonstrating stronger capability in spatio-temporal and logical inference and enabling error analysis via execution traces. The framework thus advances interpretable, robust real-world video reasoning with a clear pathway to improving symbolic representations and rule-clarity for even more reliable multi-step question answering.

Abstract

Paper Structure (22 sections, 27 equations, 14 figures, 15 tables, 1 algorithm)

This paper contains 22 sections, 27 equations, 14 figures, 15 tables, 1 algorithm.

Introduction
Related Work
Transformer-based VideoQA
Neuro-symbolic Methods
Compositionality Benchmarks
NS-VideoQA
Symbolic Representation
Static-dynamic Scene Parser Network
Symbolic Reasoning Machine
Experiments
Dataset and Metrics
Performance on NS-VideoQA
Performance on Choose, Equals, Conjunction, and First/Last
Visualization of execution traces and static-dynamic SR
Conclusion
...and 7 more sections

Figures (14)

Figure 1: An example of compositional spatio-temporal VideoQA. We mark the objects, relations, actions and time in red, green, orange, and purple respectively. The red dashed box shows a decomposing step from a compositional question to its sub-questions.
Figure 2: NS-VideoQA uses SPN (I,II) to convert the input video into SR, then uses SRM (III,IV) to decompose the compositional question into a program, and applies reasoning rules iteratively on SR according to the program, finally generates the answer of the compositional question.
Figure 3: A chart displays the distribution of sub-question types in the train and test sets.
Figure 4: The reasoning for the $Interaction\ Temporal\ Loc.$ type. Top left: the red boxes denote the detected objects "chair" and "blanket". Bottom left: the static symbolic representation of verb, attention, spatial, and contact. Right: the answer "Yes" is obtained by reasoning based on the program.
Figure 5: The reasoning for the $Shortest\ Action$ type. Top left: the red box denotes the detected object "blanket". Bottom left: the symbolic representation of dynamic scene. Right: the answer "hold a blanket" is obtained by reasoning based on the program.
...and 9 more figures

Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering

TL;DR

Abstract

Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (14)