Table of Contents
Fetching ...

Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

Naoya Sogi, Takashi Shibata, Makoto Terao

TL;DR

This work proposes a cross-modal image-text retrieval framework based on ``object-aware query perturbation,'' which generates a key feature subspace of the detected objects and perturbs the corresponding queries using this subspace to improve the object awareness in the image.

Abstract

The pre-trained vision and language (V\&L) models have substantially improved the performance of cross-modal image-text retrieval. In general, however, V\&L models have limited retrieval performance for small objects because of the rough alignment between words and the small objects in the image. In contrast, it is known that human cognition is object-centric, and we pay more attention to important objects, even if they are small. To bridge this gap between the human cognition and the V\&L model's capability, we propose a cross-modal image-text retrieval framework based on ``object-aware query perturbation.'' The proposed method generates a key feature subspace of the detected objects and perturbs the corresponding queries using this subspace to improve the object awareness in the image. In our proposed method, object-aware cross-modal image-text retrieval is possible while keeping the rich expressive power and retrieval performance of existing V\&L models without additional fine-tuning. Comprehensive experiments on four public datasets show that our method outperforms conventional algorithms. Our code is publicly available at \url{https://github.com/NEC-N-SOGI/query-perturbation}.

Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

TL;DR

This work proposes a cross-modal image-text retrieval framework based on ``object-aware query perturbation,'' which generates a key feature subspace of the detected objects and perturbs the corresponding queries using this subspace to improve the object awareness in the image.

Abstract

The pre-trained vision and language (V\&L) models have substantially improved the performance of cross-modal image-text retrieval. In general, however, V\&L models have limited retrieval performance for small objects because of the rough alignment between words and the small objects in the image. In contrast, it is known that human cognition is object-centric, and we pay more attention to important objects, even if they are small. To bridge this gap between the human cognition and the V\&L model's capability, we propose a cross-modal image-text retrieval framework based on ``object-aware query perturbation.'' The proposed method generates a key feature subspace of the detected objects and perturbs the corresponding queries using this subspace to improve the object awareness in the image. In our proposed method, object-aware cross-modal image-text retrieval is possible while keeping the rich expressive power and retrieval performance of existing V\&L models without additional fine-tuning. Comprehensive experiments on four public datasets show that our method outperforms conventional algorithms. Our code is publicly available at \url{https://github.com/NEC-N-SOGI/query-perturbation}.
Paper Structure (36 sections, 5 equations, 13 figures, 15 tables)

This paper contains 36 sections, 5 equations, 13 figures, 15 tables.

Figures (13)

  • Figure 1: Example results by our method with BLIP2 BLIP2BootstrappingLanguageImagea. In BLIP2, the matching between the target objects and the input text is weak because the objects are small, resulting in incorrect retrieval results.
  • Figure 2: Performance degradation induced by small objects.
  • Figure 3: Overview of the proposed framework. The proposed framework constructs an object-aware cross-modal projector by incorporating localization cues from object detection into the existing cross-modal projector.
  • Figure 4: Q-Former.
  • Figure 5: Q-Former with the proposed Q-Perturbation.
  • ...and 8 more figures