Table of Contents
Fetching ...

SRNN: Spatiotemporal Relational Neural Network for Intuitive Physics Understanding

Fei Yang

TL;DR

This work targets intuitive physics understanding by proposing SRNN, a brain-inspired model that unifies object attributes, relations, and timeline through Hebbian learning across What and How pathways. It grounds language generation in the same neural substrate used for perception, and evaluates on CLEVRER, showing competitive accuracy and revealing benchmark biases via cognitive ablations. The white-box design enables precise error analysis and root-cause tracing, illustrating the viability of translating biological principles into engineered systems for constrained environments. Overall, SRNN offers a principled bridge between perception, reasoning, and language, and points toward richer benchmarks and developmental extensions for human-like physical cognition.

Abstract

Human prowess in intuitive physics remains unmatched by machines. To bridge this gap, we argue for a fundamental shift towards brain-inspired computational principles. This paper introduces the Spatiotemporal Relational Neural Network (SRNN), a model that establishes a unified neural representation for object attributes, relations, and timeline, with computations governed by a Hebbian ``Fire Together, Wire Together'' mechanism across dedicated \textit{What} and \textit{How} pathways. This unified representation is directly used to generate structured linguistic descriptions of the visual scene, bridging perception and language within a shared neural substrate. On the CLEVRER benchmark, SRNN achieves competitive performance, thereby confirming its capability to represent essential spatiotemporal relations from the visual stream. Cognitive ablation analysis further reveals a benchmark bias, outlining a path for a more holistic evaluation. Finally, the white-box nature of SRNN enables precise pinpointing of error root causes. Our work provides a proof-of-concept that confirms the viability of translating key principles of biological intelligence into engineered systems for intuitive physics understanding in constrained environments.

SRNN: Spatiotemporal Relational Neural Network for Intuitive Physics Understanding

TL;DR

This work targets intuitive physics understanding by proposing SRNN, a brain-inspired model that unifies object attributes, relations, and timeline through Hebbian learning across What and How pathways. It grounds language generation in the same neural substrate used for perception, and evaluates on CLEVRER, showing competitive accuracy and revealing benchmark biases via cognitive ablations. The white-box design enables precise error analysis and root-cause tracing, illustrating the viability of translating biological principles into engineered systems for constrained environments. Overall, SRNN offers a principled bridge between perception, reasoning, and language, and points toward richer benchmarks and developmental extensions for human-like physical cognition.

Abstract

Human prowess in intuitive physics remains unmatched by machines. To bridge this gap, we argue for a fundamental shift towards brain-inspired computational principles. This paper introduces the Spatiotemporal Relational Neural Network (SRNN), a model that establishes a unified neural representation for object attributes, relations, and timeline, with computations governed by a Hebbian ``Fire Together, Wire Together'' mechanism across dedicated \textit{What} and \textit{How} pathways. This unified representation is directly used to generate structured linguistic descriptions of the visual scene, bridging perception and language within a shared neural substrate. On the CLEVRER benchmark, SRNN achieves competitive performance, thereby confirming its capability to represent essential spatiotemporal relations from the visual stream. Cognitive ablation analysis further reveals a benchmark bias, outlining a path for a more holistic evaluation. Finally, the white-box nature of SRNN enables precise pinpointing of error root causes. Our work provides a proof-of-concept that confirms the viability of translating key principles of biological intelligence into engineered systems for intuitive physics understanding in constrained environments.

Paper Structure

This paper contains 19 sections, 1 equation, 13 figures, 6 tables.

Figures (13)

  • Figure 1: (a) Semantic Nature Design. The core of this design is #action which connects eight semantic roles. (b) Spatial Nature Design. The spatial neurons represent spatial relations between objects, in addition to their intrinsic states. For #change_direction, we add #relation_attr neurons (e.g. #right, #left, #back, etc.) to indicate its attributes. For other relations, #relation_attr does not exist. (c) Connections between spatial neurons and semantic neurons. The neuron #wernicke makes the semantic neurons capable of being activated when language is necessary. All neurons are depicted in blue to indicate their inactive state.
  • Figure 2: The Fire-and-Wire Mechanism (left) and the Language Generation Module (right) in How Pathway. Left: Visual perceptions activate a relational neuron #relation along with creation and activation of entity-instance neurons ins_entity_id1 and ins_entity_id2. Then #relation triggers the creation and activation of a action-stamp neuron stamp_action_id which binds ins_entity_id1 and ins_entity_id2. If relational attributes exists for this relation, the corresponding #relation_attr1 is activated by #relation. A concept-instance neuron ins_concept_id1 is pointed by #relation_attr1 and stamp_action_id. Other neurons below the activation threshold remain inactive (shown in blue). Right: The stamp_action_id neuron triggers ins_action_id, which acts as the starting point of the semantic network. Connections are formed between ins_action_id and its lexical neuron _action_name, as well as with #action. Meanwhile, #relation propagates signals to #action and #semantic_roles via predefined neural pathways in Nature Design. Joint signals from ins_action_id and #relation activate #action, which in turn sends signals to all the semantic-role neurons. Only #semantic_role1 and #semantic_role2 are activated upon receiving sufficient signals. Finally, stamp_subaction_id1 wires #semantic_role1 and ins_entity_id1, and stamp_subaction_id2 binds #semantic_role2 and ins_entity_id2.
  • Figure 3: The Fire-and-Wire Mechanism in Temporal Binding. The temporal neuron stamp_time_id1 marks the origin of the timeline and points to stamp_time_id2. All action-stamp neurons present during this temporal window are associated with stamp_time_id1. As time progresses, new temporal neurons are generated, activated, and linked sequentially to form an ordered chain.
  • Figure 4: The Fire-and-Wire Mechanism and the Language Generation Module in What Pathway. The neuron ins_entity_id initiates the creation and activation of stamp_entity_id. Simultaneously, concept-instance neurons are created and activated to encode the entity's attributes, which are then bound together by stamp_entity_id. These instance neurons, in turn, activate corresponding lexical neurons (indicated by yellow dashed lines), denoted as _ins_name.
  • Figure 5: Human-Driven Parameter Tuning Loop. In the list notation, entities $x$ and $y$ participate in relation $R$ at time slot $t$.
  • ...and 8 more figures