SRNN: Spatiotemporal Relational Neural Network for Intuitive Physics Understanding
Fei Yang
TL;DR
This work targets intuitive physics understanding by proposing SRNN, a brain-inspired model that unifies object attributes, relations, and timeline through Hebbian learning across What and How pathways. It grounds language generation in the same neural substrate used for perception, and evaluates on CLEVRER, showing competitive accuracy and revealing benchmark biases via cognitive ablations. The white-box design enables precise error analysis and root-cause tracing, illustrating the viability of translating biological principles into engineered systems for constrained environments. Overall, SRNN offers a principled bridge between perception, reasoning, and language, and points toward richer benchmarks and developmental extensions for human-like physical cognition.
Abstract
Human prowess in intuitive physics remains unmatched by machines. To bridge this gap, we argue for a fundamental shift towards brain-inspired computational principles. This paper introduces the Spatiotemporal Relational Neural Network (SRNN), a model that establishes a unified neural representation for object attributes, relations, and timeline, with computations governed by a Hebbian ``Fire Together, Wire Together'' mechanism across dedicated \textit{What} and \textit{How} pathways. This unified representation is directly used to generate structured linguistic descriptions of the visual scene, bridging perception and language within a shared neural substrate. On the CLEVRER benchmark, SRNN achieves competitive performance, thereby confirming its capability to represent essential spatiotemporal relations from the visual stream. Cognitive ablation analysis further reveals a benchmark bias, outlining a path for a more holistic evaluation. Finally, the white-box nature of SRNN enables precise pinpointing of error root causes. Our work provides a proof-of-concept that confirms the viability of translating key principles of biological intelligence into engineered systems for intuitive physics understanding in constrained environments.
