Image Segmentation: Inducing graph-based learning
Aryan Singh, Pepijn Van de Ven, Ciarán Eising, Patrick Denny
TL;DR
Addresses semantic segmentation across diverse image domains including natural scenes, fisheye driving imagery, and dermoscopic images, highlighting CNN limitations with geometric distortions. Proposes UNet-GNN, a hybrid encoder-decoder architecture with a Graph Neural Network bottleneck that builds a graph from the deepest encoder features, connects nodes to their $k$-nearest neighbors in a warped coordinate space using relative positional encoding, and updates features via graph convolution, e.g., $P'_{xy} = P_{xy} + R_{xy}$ and $h_i^{(t)} = abla\sigma\left(\sum_{j \in N(v_i)} W\left(h_j^{(t-1)}\right) + b\right)$. Evaluated on PascalVOC, WoodScape, and ISIC2016, the method shows robustness to distortions and improves over CNN baselines, with notable gains on WoodScape and better boundary delineation on ISIC2016. The work demonstrates the value of explicit relational modeling for segmentation in challenging real-world conditions and discusses future work with the Generalized Wasserstein Dice Loss to better encode inter-class relationships.
Abstract
This study explores the potential of graph neural networks (GNNs) to enhance semantic segmentation across diverse image modalities. We evaluate the effectiveness of a novel GNN-based U-Net architecture on three distinct datasets: PascalVOC, a standard benchmark for natural image segmentation, WoodScape, a challenging dataset of fisheye images commonly used in autonomous driving, introducing significant geometric distortions; and ISIC2016, a dataset of dermoscopic images for skin lesion segmentation. We compare our proposed UNet-GNN model against established convolutional neural networks (CNNs) based segmentation models, including U-Net and U-Net++, as well as the transformer-based SwinUNet. Unlike these methods, which primarily rely on local convolutional operations or global self-attention, GNNs explicitly model relationships between image regions by constructing and operating on a graph representation of the image features. This approach allows the model to capture long-range dependencies and complex spatial relationships, which we hypothesize will be particularly beneficial for handling geometric distortions present in fisheye imagery and capturing intricate boundaries in medical images. Our analysis demonstrates the versatility of GNNs in addressing diverse segmentation challenges and highlights their potential to improve segmentation accuracy in various applications, including autonomous driving and medical image analysis.
