Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering
Lili Liang, Guanglu Sun, Jin Qiu, Lizhong Zhang
TL;DR
This work tackles real-world VideoQA with compositional spatio-temporal questions by introducing NS-VideoQA, a neural-symbolic framework that first converts video content into a Symbolic Representation ($SR$) using a Scene Parser Network (SPN) and then performs top-down question decomposition and bottom-up program execution with a Symbolic Reasoning Machine (SRM). The SPN jointly extracts static scene information ($R_{static}(v)$) and dynamic action chronologies ($R_{dynamic}(v)$) to form $SR(v)=(R_{static}(v),R_{dynamic}(v))$, while SRM employs a polymorphic program executor to answer questions with traceable intermediate reasoning. Evaluated on the AGQA Decomp benchmark, NS-VideoQA outperforms purely neural VideoQA models in compositional accuracy (CA), right for the wrong reasons (RWR), and internal consistency (IC), demonstrating stronger capability in spatio-temporal and logical inference and enabling error analysis via execution traces. The framework thus advances interpretable, robust real-world video reasoning with a clear pathway to improving symbolic representations and rule-clarity for even more reliable multi-step question answering.
Abstract
Compositional spatio-temporal reasoning poses a significant challenge in the field of video question answering (VideoQA). Existing approaches struggle to establish effective symbolic reasoning structures, which are crucial for answering compositional spatio-temporal questions. To address this challenge, we propose a neural-symbolic framework called Neural-Symbolic VideoQA (NS-VideoQA), specifically designed for real-world VideoQA tasks. The uniqueness and superiority of NS-VideoQA are two-fold: 1) It proposes a Scene Parser Network (SPN) to transform static-dynamic video scenes into Symbolic Representation (SR), structuralizing persons, objects, relations, and action chronologies. 2) A Symbolic Reasoning Machine (SRM) is designed for top-down question decompositions and bottom-up compositional reasonings. Specifically, a polymorphic program executor is constructed for internally consistent reasoning from SR to the final answer. As a result, Our NS-VideoQA not only improves the compositional spatio-temporal reasoning in real-world VideoQA task, but also enables step-by-step error analysis by tracing the intermediate results. Experimental evaluations on the AGQA Decomp benchmark demonstrate the effectiveness of the proposed NS-VideoQA framework. Empirical studies further confirm that NS-VideoQA exhibits internal consistency in answering compositional questions and significantly improves the capability of spatio-temporal and logical inference for VideoQA tasks.
