Table of Contents
Fetching ...

Knowledge-aware Text-Image Retrieval for Remote Sensing Images

Li Mi, Xianjie Dai, Javiera Castillo-Navarro, Devis Tuia

TL;DR

Experimental results on three commonly used remote sensing text–image retrieval benchmarks show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.

Abstract

Image-based retrieval in large Earth observation archives is challenging because one needs to navigate across thousands of candidate matches only with the query image as a guide. By using text as information supporting the visual query, the retrieval system gains in usability, but at the same time faces difficulties due to the diversity of visual signals that cannot be summarized by a short caption only. For this reason, as a matching-based task, cross-modal text-image retrieval often suffers from information asymmetry between texts and images. To address this challenge, we propose a Knowledge-aware Text-Image Retrieval (KTIR) method for remote sensing images. By mining relevant information from an external knowledge graph, KTIR enriches the text scope available in the search query and alleviates the information gaps between texts and images for better matching. Moreover, by integrating domain-specific knowledge, KTIR also enhances the adaptation of pre-trained vision-language models to remote sensing applications. Experimental results on three commonly used remote sensing text-image retrieval benchmarks show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.

Knowledge-aware Text-Image Retrieval for Remote Sensing Images

TL;DR

Experimental results on three commonly used remote sensing text–image retrieval benchmarks show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.

Abstract

Image-based retrieval in large Earth observation archives is challenging because one needs to navigate across thousands of candidate matches only with the query image as a guide. By using text as information supporting the visual query, the retrieval system gains in usability, but at the same time faces difficulties due to the diversity of visual signals that cannot be summarized by a short caption only. For this reason, as a matching-based task, cross-modal text-image retrieval often suffers from information asymmetry between texts and images. To address this challenge, we propose a Knowledge-aware Text-Image Retrieval (KTIR) method for remote sensing images. By mining relevant information from an external knowledge graph, KTIR enriches the text scope available in the search query and alleviates the information gaps between texts and images for better matching. Moreover, by integrating domain-specific knowledge, KTIR also enhances the adaptation of pre-trained vision-language models to remote sensing applications. Experimental results on three commonly used remote sensing text-image retrieval benchmarks show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.
Paper Structure (39 sections, 11 equations, 9 figures, 10 tables)

This paper contains 39 sections, 11 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Intuition behind the proposed KTIR system: in a standard text-image retrieval approach (a), text and images are matched directly, while in KTIR (b), commonsense knowledge is added from external sources (a knowledge base) to make the retrieval more varied, robust to ambiguities and consistent with general knowledge.
  • Figure 2: (a) The pipeline of the KTIR. The proposed text-image retrieval system comprises three main components: an image encoder, a knowledge-aware text encoder and a similarity measurement module. The image feature ($\mathbf{f}_{img}$) are obtained by ViT dosovitskiy2020image ($\operatorname{ViT}$). A BERT devlin2019bert is used as text-only mode ($\operatorname{BERT}_{text}$, green mode) and multimodal mode ($\operatorname{BERT}_{multi}$, yellow mode) to encode the knowledge-aware text feature ($\mathbf{f}_{txt}$) and the multimodal feature ($\mathbf{f}_{multi}$), respectively. Then the text-image contrastive loss ($\mathcal{L}_{\mathrm{con}}$) and the text-image matching loss ($\mathcal{L}_{\mathrm{mat}}$) are used as the training objectives for cross-modality retrieval. (b) The knowledge extraction process. In the knowledge-aware text encoder, the knowledge extraction process includes keyword extraction, knowledge retrieval and knowledge sentence construction. After knowledge extraction, the knowledge sentences $\mathbf{s_k}$ are collected for each caption $\mathbf{s}$. Numbers in the feature vectors denote their dimension.
  • Figure 3: Examples of images and text sentences are from the three datasets.
  • Figure 4: Examples retrieved knowledge sentences from different knowledge sources. Keywords (nouns) in the text and knowledge sentences are in bold. External concepts, whether highly related, somewhat related, or unrelated to the image content, are respectively marked in green, orange, and red. If there are more than 3 knowledge sentences, we randomly select 3 sentences from the knowledge extraction results. The colors are manually annotated.
  • Figure 5: The cosine similarity scores for image-text retrieval of 21 image and text pairs from the UCM-Caption dataset sampled from different scene categories. The horizontal axis represents the text index, and the vertical axis represents the image index.
  • ...and 4 more figures