TGraphX: Tensor-Aware Graph Neural Network for Multi-Dimensional Feature Learning
Arash Sajjadi, Mark Eramian
TL;DR
TGraphX introduces a spatially aware graph neural network that preserves full 2D spatial context by representing each image patch as a multi-dimensional node $X_i \in \mathbb{R}^{C \times H \times W}$. Nodes communicate via Conv$_{1\times1}$ message passing on concatenated features, and aggregated messages are refined by a deep CNN aggregator with residual connections, enabling end-to-end differentiability. The framework is validated in a car-detection setting by fusing two detectors (YOLOv11 and RetinaNet) through per-car detection graphs, achieving higher average IoU on test data and demonstrating robust performance with limited data and no augmentation. The paper emphasizes modularity, efficiency, and extensibility, arguing that maintaining spatial fidelity during graph-based reasoning yields improved localization and ensemble reasoning capabilities. Overall, TGraphX provides a unified, end-to-end approach that combines local spatial detail with global relational context for structured visual reasoning tasks.
Abstract
TGraphX presents a novel paradigm in deep learning by unifying convolutional neural networks (CNNs) with graph neural networks (GNNs) to enhance visual reasoning tasks. Traditional CNNs excel at extracting rich spatial features from images but lack the inherent capability to model inter-object relationships. Conversely, conventional GNNs typically rely on flattened node features, thereby discarding vital spatial details. TGraphX overcomes these limitations by employing CNNs to generate multi-dimensional node features (e.g., (3*128*128) tensors) that preserve local spatial semantics. These spatially aware nodes participate in a graph where message passing is performed using 1*1 convolutions, which fuse adjacent features while maintaining their structure. Furthermore, a deep CNN aggregator with residual connections is used to robustly refine the fused messages, ensuring stable gradient flow and end-to-end trainability. Our approach not only bridges the gap between spatial feature extraction and relational reasoning but also demonstrates significant improvements in object detection refinement and ensemble reasoning.
