Table of Contents
Fetching ...

Through the PRISm: Importance-Aware Scene Graphs for Image Retrieval

Dimitrios Georgoulopoulos, Nikolaos Chaidos, Angeliki Dimitriou, Giorgos Stamou

TL;DR

The paper tackles semantic image retrieval by leveraging structured scene graphs to model objects and relations. It introduces PRISm, combining an Importance Prediction Module that prunes scene graphs with an Edge-Aware Contextual GNN that fuses relational structure with global visual features. A caption-grounded supervision and a weighted regression objective guide the model toward semantically meaningful embeddings. Experiments on PSG and Flickr30k show consistent improvements over graph-based, vision-only, and vision-language baselines, validating the importance of relational reasoning.

Abstract

Accurately retrieving images that are semantically similar remains a fundamental challenge in computer vision, as traditional methods often fail to capture the relational and contextual nuances of a scene. We introduce PRISm (Pruning-based Image Retrieval via Importance Prediction on Semantic Graphs), a multimodal framework that advances image-to-image retrieval through two novel components. First, the Importance Prediction Module identifies and retains the most critical objects and relational triplets within an image while pruning irrelevant elements. Second, the Edge-Aware Graph Neural Network explicitly encodes relational structure and integrates global visual features to produce semantically informed image embeddings. PRISm achieves image retrieval that closely aligns with human perception by explicitly modeling the semantic importance of objects and their interactions, capabilities largely absent in prior approaches. Its architecture effectively combines relational reasoning with visual representation, enabling semantically grounded retrieval. Extensive experiments on benchmark and real-world datasets demonstrate consistently superior top-ranked performance, while qualitative analyses show that PRISm accurately captures key objects and interactions, producing interpretable and semantically meaningful results.

Through the PRISm: Importance-Aware Scene Graphs for Image Retrieval

TL;DR

The paper tackles semantic image retrieval by leveraging structured scene graphs to model objects and relations. It introduces PRISm, combining an Importance Prediction Module that prunes scene graphs with an Edge-Aware Contextual GNN that fuses relational structure with global visual features. A caption-grounded supervision and a weighted regression objective guide the model toward semantically meaningful embeddings. Experiments on PSG and Flickr30k show consistent improvements over graph-based, vision-only, and vision-language baselines, validating the importance of relational reasoning.

Abstract

Accurately retrieving images that are semantically similar remains a fundamental challenge in computer vision, as traditional methods often fail to capture the relational and contextual nuances of a scene. We introduce PRISm (Pruning-based Image Retrieval via Importance Prediction on Semantic Graphs), a multimodal framework that advances image-to-image retrieval through two novel components. First, the Importance Prediction Module identifies and retains the most critical objects and relational triplets within an image while pruning irrelevant elements. Second, the Edge-Aware Graph Neural Network explicitly encodes relational structure and integrates global visual features to produce semantically informed image embeddings. PRISm achieves image retrieval that closely aligns with human perception by explicitly modeling the semantic importance of objects and their interactions, capabilities largely absent in prior approaches. Its architecture effectively combines relational reasoning with visual representation, enabling semantically grounded retrieval. Extensive experiments on benchmark and real-world datasets demonstrate consistently superior top-ranked performance, while qualitative analyses show that PRISm accurately captures key objects and interactions, producing interpretable and semantically meaningful results.

Paper Structure

This paper contains 19 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Example of top-2 image-to-image retrievals. PRISm (i) leverages scene graphs alongside visual features, (ii) prunes graphs to retain the most important objects and relations, and (iii) explicitly encodes object interactions, emphasizing relational structure.
  • Figure 2: Inference pipeline of the Important Prediction Module. Image - scene graph embeddings $(z_I, z_G)$ are extracted along with visual embeddings of objects $\left(z^{vis}_{dog}, z^{vis}_{frisbee}, z^{vis}_{grass}\right)$ and textual embeddings of object and relation labels $\left(z^{text}_{dog}, z^{text}_{frisbee}, z^{text}_{grass}, z_{biting}\right)$. Embeddings are passed to the trained module to predict scores $\hat{s}(u)$ indicating retained objects and triplets.
  • Figure 3: Overall retrieval pipeline of PRISm. Similarity between images $I_1$ and $I_2$ is computed as the inner product of embeddings that combine projected global visual features with edge-aware multimodal scene graph representations.
  • Figure 4: Object and relation retention rates by graph size.
  • Figure 5: Retrieval examples from PRISm. (a) Top-3 retrieval comparisons against best-performing SotA methods (CLIP (VL), DINOv3 (Vision), and Hi-SIGIR (Scene-Graph-based)). (b) Additional PRISm retrievals illustrating detail-aware matching via the Edge-Aware GNN, as well as the effects of importance-based pruning and the synergy between graph semantics and visual cues.