Table of Contents
Fetching ...

SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval

Nikolaos Chaidos, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Stamou

TL;DR

SCENIR addresses the semantic gap in image-to-image retrieval by moving from low-level visual cues to structured semantic representations via scene graphs. It introduces an unsupervised Graph Autoencoder with a split GNN encoder, two 2-layer MLP decoders for edges and features, and adversarial regularization, enabling effective graph embeddings without ground-truth similarity labels. Evaluation uses Graph Edit Distance (GED) as a deterministic ground truth, demonstrating that SCENIR achieves state-of-the-art retrieval performance while maintaining linear-time scalability. The work also demonstrates extendability to unannotated data and counterfactual retrieval, highlighting practical impact for scalable, semantically grounded image retrieval and explainable retrieval workflows. All mathematical constructs are expressed with $...$ delimiters to preserve clarity and enable precise indexing in downstream search and analysis pipelines.

Abstract

Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.

SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval

TL;DR

SCENIR addresses the semantic gap in image-to-image retrieval by moving from low-level visual cues to structured semantic representations via scene graphs. It introduces an unsupervised Graph Autoencoder with a split GNN encoder, two 2-layer MLP decoders for edges and features, and adversarial regularization, enabling effective graph embeddings without ground-truth similarity labels. Evaluation uses Graph Edit Distance (GED) as a deterministic ground truth, demonstrating that SCENIR achieves state-of-the-art retrieval performance while maintaining linear-time scalability. The work also demonstrates extendability to unannotated data and counterfactual retrieval, highlighting practical impact for scalable, semantically grounded image retrieval and explainable retrieval workflows. All mathematical constructs are expressed with delimiters to preserve clarity and enable precise indexing in downstream search and analysis pipelines.

Abstract

Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.

Paper Structure

This paper contains 34 sections, 4 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Top: Example of visual biases (color bias) in image retrieval when using visual models vs ours. Bottom: Example of the retrieval variability of SBERT models (top-1 retrieved items).
  • Figure 2: Agreement between top-1 retrieved items with various SBERT models (MPNet: all-mpnet-base-v2, RoBERTa: all-distilroberta-v1, MiniLM: all-minilm-l6-v2) for caption retrieval.
  • Figure 3: Overall Scene Graph Retrieval pipeline: training (top) and inference (bottom), with scene graphs linked to images in the dataset. The architecture of the proposed SCENIR model is depicted. The only loss term that does not originate from the Discriminator or the Decoder's modules is $\mathcal{L}_{KL}$ for the variational regularization, that is applied directly to the encoder output.
  • Figure 4: Maximum path lengths and mean values for graph metrics, for the preprocessed PSG graphs.
  • Figure 5: NDCG@5 score for different number of GNN layers.
  • ...and 8 more figures