Mending of Spatio-Temporal Dependencies in Block Adjacency Matrix
Osama Ahmad, Omer Abdul Jalil, Usman Nazir, Murtaza Taj
TL;DR
This work tackles the limitation of the Block Adjacency Matrix for spatio-temporal graphs by introducing STBAM-GNN, an end-to-end architecture that mends temporal connections missing in the BA approach. A transformer-based encoder augments the BA to produce a connected, learnable spatio-temporal graph, while a GAT-based GNN learns joint spatial-temporal representations for downstream tasks. The method demonstrates state-of-the-art performance on the C2D2 dataset and competitive results on SurgVisDom with far fewer parameters than competing CLIP/3D-CNN baselines, and is supported by spectral analyses showing increased connectivity (fewer zero-Laplacian eigenvalues and higher Fiedler values). This approach offers a computationally efficient pathway to robust spatio-temporal graph learning suitable for online inference and broader domains.
Abstract
In the realm of applications where data dynamically evolves across spatial and temporal dimensions, Graph Neural Networks (GNNs) are often complemented by sequence modeling architectures, such as RNNs and transformers, to effectively model temporal changes. These hybrid models typically arrange the spatial and temporal learning components in series. A pioneering effort to jointly model the spatio-temporal dependencies using only GNNs was the introduction of the Block Adjacency Matrix \(\mathbf{A_B}\) \cite{1}, which was constructed by diagonally concatenating adjacency matrices from graphs at different time steps. This approach resulted in a single graph encompassing complete spatio-temporal data; however, the graphs from different time steps remained disconnected, limiting GNN message-passing to spatially connected nodes only. Addressing this critical challenge, we propose a novel end-to-end learning architecture specifically designed to mend the temporal dependencies, resulting in a well-connected graph. Thus, we provide a framework for the learnable representation of spatio-temporal data as graphs. Our methodology demonstrates superior performance on benchmark datasets, such as SurgVisDom and C2D2, surpassing existing state-of-the-art graph models in terms of accuracy. Our model also achieves significantly lower computational complexity, having far fewer parameters than methods reliant on CLIP and 3D CNN architectures.
