Table of Contents
Fetching ...

TGraphX: Tensor-Aware Graph Neural Network for Multi-Dimensional Feature Learning

Arash Sajjadi, Mark Eramian

TL;DR

TGraphX introduces a spatially aware graph neural network that preserves full 2D spatial context by representing each image patch as a multi-dimensional node $X_i \in \mathbb{R}^{C \times H \times W}$. Nodes communicate via Conv$_{1\times1}$ message passing on concatenated features, and aggregated messages are refined by a deep CNN aggregator with residual connections, enabling end-to-end differentiability. The framework is validated in a car-detection setting by fusing two detectors (YOLOv11 and RetinaNet) through per-car detection graphs, achieving higher average IoU on test data and demonstrating robust performance with limited data and no augmentation. The paper emphasizes modularity, efficiency, and extensibility, arguing that maintaining spatial fidelity during graph-based reasoning yields improved localization and ensemble reasoning capabilities. Overall, TGraphX provides a unified, end-to-end approach that combines local spatial detail with global relational context for structured visual reasoning tasks.

Abstract

TGraphX presents a novel paradigm in deep learning by unifying convolutional neural networks (CNNs) with graph neural networks (GNNs) to enhance visual reasoning tasks. Traditional CNNs excel at extracting rich spatial features from images but lack the inherent capability to model inter-object relationships. Conversely, conventional GNNs typically rely on flattened node features, thereby discarding vital spatial details. TGraphX overcomes these limitations by employing CNNs to generate multi-dimensional node features (e.g., (3*128*128) tensors) that preserve local spatial semantics. These spatially aware nodes participate in a graph where message passing is performed using 1*1 convolutions, which fuse adjacent features while maintaining their structure. Furthermore, a deep CNN aggregator with residual connections is used to robustly refine the fused messages, ensuring stable gradient flow and end-to-end trainability. Our approach not only bridges the gap between spatial feature extraction and relational reasoning but also demonstrates significant improvements in object detection refinement and ensemble reasoning.

TGraphX: Tensor-Aware Graph Neural Network for Multi-Dimensional Feature Learning

TL;DR

TGraphX introduces a spatially aware graph neural network that preserves full 2D spatial context by representing each image patch as a multi-dimensional node . Nodes communicate via Conv message passing on concatenated features, and aggregated messages are refined by a deep CNN aggregator with residual connections, enabling end-to-end differentiability. The framework is validated in a car-detection setting by fusing two detectors (YOLOv11 and RetinaNet) through per-car detection graphs, achieving higher average IoU on test data and demonstrating robust performance with limited data and no augmentation. The paper emphasizes modularity, efficiency, and extensibility, arguing that maintaining spatial fidelity during graph-based reasoning yields improved localization and ensemble reasoning capabilities. Overall, TGraphX provides a unified, end-to-end approach that combines local spatial detail with global relational context for structured visual reasoning tasks.

Abstract

TGraphX presents a novel paradigm in deep learning by unifying convolutional neural networks (CNNs) with graph neural networks (GNNs) to enhance visual reasoning tasks. Traditional CNNs excel at extracting rich spatial features from images but lack the inherent capability to model inter-object relationships. Conversely, conventional GNNs typically rely on flattened node features, thereby discarding vital spatial details. TGraphX overcomes these limitations by employing CNNs to generate multi-dimensional node features (e.g., (3*128*128) tensors) that preserve local spatial semantics. These spatially aware nodes participate in a graph where message passing is performed using 1*1 convolutions, which fuse adjacent features while maintaining their structure. Furthermore, a deep CNN aggregator with residual connections is used to robustly refine the fused messages, ensuring stable gradient flow and end-to-end trainability. Our approach not only bridges the gap between spatial feature extraction and relational reasoning but also demonstrates significant improvements in object detection refinement and ensemble reasoning.

Paper Structure

This paper contains 42 sections, 21 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Detailed flowchart of the TGraphX pipeline. Input Stage: A full image is divided into patches. Pre‑Encoder Stage: A decision determines whether to process patches with a PreEncoder (enriching features) or bypass it. CNN Encoder Stage: The selected patches are processed by a CNN Encoder, incorporating skip connections, dropout, batch normalization, and residual connections to generate spatial feature maps. Graph Construction Stage: These feature maps form graph nodes with edges based on patch proximity. GNN Layers Stage: A stack of ConvMessagePassing and DeepCNNAggregator layers (with dropout and residual skips) refines the node features. Pooling & Classification Stage: Spatial pooling reduces the refined features to vectors, which are then classified by a linear layer. An optional direct skip from the CNN output to the classifier is also included.
  • Figure 2: Detection graph schematic for a car detection. When both detectors fire, nodes corresponding to YOLOv11 and RetinaNet detections are connected to a union node that aggregates spatial information. Each node holds a feature map of dimensions 3$\times$128$\times$128.
  • Figure 3: Training performance visualization across 50 epochs, composed of four panels arranged in a 1$\times$4 layout.