3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting

Xuri Ge; Songpei Xu; Fuhai Chen; Jie Wang; Guoxin Wang; Shan An; Joemon M. Jose

3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting

Xuri Ge, Songpei Xu, Fuhai Chen, Jie Wang, Guoxin Wang, Shan An, Joemon M. Jose

TL;DR

This work tackles image-sentence retrieval by bridging the semantic gap with a modality-independent framework that preserves efficient cross-modal alignment. It introduces 3SHNet, which employs segmentation-guided Visual-Semantic Modelling (VSeM) and Visual-Spatial Modelling (VSpM) to self-highlight semantic and spatial saliencies, fused through a Generalized Pooling Operator into a $1024$-dimensional joint embedding. Training uses a bidirectional triplet ranking loss with hard negative mining and a margin of $\\gamma=0.2$, achieving state-of-the-art results on MS-COCO and Flickr30K across region-, grid-, and hybrid-level features, while demonstrating strong cross-dataset generalization and improved inference speed via modality independence. The approach reduces reliance on text-driven visual guidance and highlights human-like attention mechanisms through segmentation, offering practical benefits for real-world multi-modal retrieval systems and scalable deployment.

Abstract

In this paper, we propose a novel visual Semantic-Spatial Self-Highlighting Network (termed 3SHNet) for high-precision, high-efficiency and high-generalization image-sentence retrieval. 3SHNet highlights the salient identification of prominent objects and their spatial locations within the visual modality, thus allowing the integration of visual semantics-spatial interactions and maintaining independence between two modalities. This integration effectively combines object regions with the corresponding semantic and position layouts derived from segmentation to enhance the visual representation. And the modality-independence guarantees efficiency and generalization. Additionally, 3SHNet utilizes the structured contextual visual scene information from segmentation to conduct the local (region-based) or global (grid-based) guidance and achieve accurate hybrid-level retrieval. Extensive experiments conducted on MS-COCO and Flickr30K benchmarks substantiate the superior performances, inference efficiency and generalization of the proposed 3SHNet when juxtaposed with contemporary state-of-the-art methodologies. Specifically, on the larger MS-COCO 5K test set, we achieve 16.3%, 24.8%, and 18.3% improvements in terms of rSum score, respectively, compared with the state-of-the-art methods using different image representations, while maintaining optimal retrieval efficiency. Moreover, our performance on cross-dataset generalization improves by 18.6%. Data and code are available at https://github.com/XuriGe1995/3SHNet.

3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting

TL;DR

-dimensional joint embedding. Training uses a bidirectional triplet ranking loss with hard negative mining and a margin of

, achieving state-of-the-art results on MS-COCO and Flickr30K across region-, grid-, and hybrid-level features, while demonstrating strong cross-dataset generalization and improved inference speed via modality independence. The approach reduces reliance on text-driven visual guidance and highlights human-like attention mechanisms through segmentation, offering practical benefits for real-world multi-modal retrieval systems and scalable deployment.

Abstract

Paper Structure (30 sections, 8 equations, 9 figures, 8 tables)

This paper contains 30 sections, 8 equations, 9 figures, 8 tables.

Introduction
Related Work
Approach
Visual-Textual Feature Extractors
Visual Semantic-Spatial Multimodal Modelling
Visual-semantic multimodal modelling.
Visual-spatial multimodal modelling.
Feature Aggregation and Objective Function
Experiments
Experiment Setup
Dataset.
Evaluation metrics.
Implementation details.
Quantitative Comparison
Quantitative comparison on MS-COCO.
...and 15 more sections

Figures (9)

Figure 1: Segmentation is combined with the mass object regions to highlight the prominent objects and their locations.
Figure 2: Illustration of the proposed 3SHNet. It mainly consists of visual-semantic modelling module (VSeM) and visual-spatial modelling module (VSpM), where the semantic feature and the position map of the segmentation are respectively imposed to guide the local- and global-level visual features in visual multimodal interactions.
Figure 3: Inference speed (Kpps wang2022coder means the number of image/sentence queries completed per second) and performance on MS-COCO 5K test set for image-text retrieval on single GPU (upper right is better).
Figure 4: Comparisons of our proposed 3SHNet with different activate functions (Sigmoid mcculloch1943logicalVS. Softmax chorowski2015attention) in visual-semantic modelling. $R@$ is the abbreviation of $Recall@$.
Figure 5: alient region-level (on the left) and grid-level (on the right) object visualizations from visual-semantic multimodal modelling (VSeM) guided by segmentations on MS-COCO dataset. Each visualization contains a visual image containing the original object outcome, its segmentation outcome and the corresponding VSeM outcome, and two random matching sentences with object highlights. The greater the salience of objects, the greater the transparency (best viewed in color).
...and 4 more figures

3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting

TL;DR

Abstract

3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting

Authors

TL;DR

Abstract

Table of Contents

Figures (9)