Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

Naoya Sogi; Takashi Shibata; Makoto Terao

Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

Naoya Sogi, Takashi Shibata, Makoto Terao

TL;DR

This work proposes a cross-modal image-text retrieval framework based on ``object-aware query perturbation,'' which generates a key feature subspace of the detected objects and perturbs the corresponding queries using this subspace to improve the object awareness in the image.

Abstract

The pre-trained vision and language (V\&L) models have substantially improved the performance of cross-modal image-text retrieval. In general, however, V\&L models have limited retrieval performance for small objects because of the rough alignment between words and the small objects in the image. In contrast, it is known that human cognition is object-centric, and we pay more attention to important objects, even if they are small. To bridge this gap between the human cognition and the V\&L model's capability, we propose a cross-modal image-text retrieval framework based on ``object-aware query perturbation.'' The proposed method generates a key feature subspace of the detected objects and perturbs the corresponding queries using this subspace to improve the object awareness in the image. In our proposed method, object-aware cross-modal image-text retrieval is possible while keeping the rich expressive power and retrieval performance of existing V\&L models without additional fine-tuning. Comprehensive experiments on four public datasets show that our method outperforms conventional algorithms. Our code is publicly available at \url{https://github.com/NEC-N-SOGI/query-perturbation}.

Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

TL;DR

Abstract

Paper Structure (36 sections, 5 equations, 13 figures, 15 tables)

This paper contains 36 sections, 5 equations, 13 figures, 15 tables.

Introduction
Related works
Performance Degradation Induced by Small Objects
Method
Overview
Basic Idea: Object-Aware Query Perturbation
Q-Perturbation Module for Single Objects
Extension to Multiple Objects
Beyond the Q-Perturbation Module for Q-Former
Extension to other pre-trained V&L models.
Other Tasks with Our Q-Perturbation.
Experiments
Settings
Datasets and Experimental Protocols.
Evaluation Metrics.
...and 21 more sections

Figures (13)

Figure 1: Example results by our method with BLIP2 BLIP2BootstrappingLanguageImagea. In BLIP2, the matching between the target objects and the input text is weak because the objects are small, resulting in incorrect retrieval results.
Figure 2: Performance degradation induced by small objects.
Figure 3: Overview of the proposed framework. The proposed framework constructs an object-aware cross-modal projector by incorporating localization cues from object detection into the existing cross-modal projector.
Figure 4: Q-Former.
Figure 5: Q-Former with the proposed Q-Perturbation.
...and 8 more figures

Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

TL;DR

Abstract

Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (13)