Knowledge-aware Text-Image Retrieval for Remote Sensing Images

Li Mi; Xianjie Dai; Javiera Castillo-Navarro; Devis Tuia

Knowledge-aware Text-Image Retrieval for Remote Sensing Images

Li Mi, Xianjie Dai, Javiera Castillo-Navarro, Devis Tuia

TL;DR

Experimental results on three commonly used remote sensing text–image retrieval benchmarks show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.

Abstract

Image-based retrieval in large Earth observation archives is challenging because one needs to navigate across thousands of candidate matches only with the query image as a guide. By using text as information supporting the visual query, the retrieval system gains in usability, but at the same time faces difficulties due to the diversity of visual signals that cannot be summarized by a short caption only. For this reason, as a matching-based task, cross-modal text-image retrieval often suffers from information asymmetry between texts and images. To address this challenge, we propose a Knowledge-aware Text-Image Retrieval (KTIR) method for remote sensing images. By mining relevant information from an external knowledge graph, KTIR enriches the text scope available in the search query and alleviates the information gaps between texts and images for better matching. Moreover, by integrating domain-specific knowledge, KTIR also enhances the adaptation of pre-trained vision-language models to remote sensing applications. Experimental results on three commonly used remote sensing text-image retrieval benchmarks show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.

Knowledge-aware Text-Image Retrieval for Remote Sensing Images

TL;DR

Abstract

Paper Structure (39 sections, 11 equations, 9 figures, 10 tables)

This paper contains 39 sections, 11 equations, 9 figures, 10 tables.

Introduction
Related Work
Text-Image Retrieval for Remote Sensing Images
External Knowledge Sources
Knowledge-aware Vision-Language Research
Knowledge-aware Text-Image Retrieval Method
Image encoder
Knowledge-aware text encoder
Knowledge extraction
Keyword extraction
Knolwedge triplet retrieval
Knowledge sentence construction
Fusion in the text encoder
knowledge-aware text feature
multi-modal feature
...and 24 more sections

Figures (9)

Figure 1: Intuition behind the proposed KTIR system: in a standard text-image retrieval approach (a), text and images are matched directly, while in KTIR (b), commonsense knowledge is added from external sources (a knowledge base) to make the retrieval more varied, robust to ambiguities and consistent with general knowledge.
Figure 2: (a) The pipeline of the KTIR. The proposed text-image retrieval system comprises three main components: an image encoder, a knowledge-aware text encoder and a similarity measurement module. The image feature ($\mathbf{f}_{img}$) are obtained by ViT dosovitskiy2020image ($\operatorname{ViT}$). A BERT devlin2019bert is used as text-only mode ($\operatorname{BERT}_{text}$, green mode) and multimodal mode ($\operatorname{BERT}_{multi}$, yellow mode) to encode the knowledge-aware text feature ($\mathbf{f}_{txt}$) and the multimodal feature ($\mathbf{f}_{multi}$), respectively. Then the text-image contrastive loss ($\mathcal{L}_{\mathrm{con}}$) and the text-image matching loss ($\mathcal{L}_{\mathrm{mat}}$) are used as the training objectives for cross-modality retrieval. (b) The knowledge extraction process. In the knowledge-aware text encoder, the knowledge extraction process includes keyword extraction, knowledge retrieval and knowledge sentence construction. After knowledge extraction, the knowledge sentences $\mathbf{s_k}$ are collected for each caption $\mathbf{s}$. Numbers in the feature vectors denote their dimension.
Figure 3: Examples of images and text sentences are from the three datasets.
Figure 4: Examples retrieved knowledge sentences from different knowledge sources. Keywords (nouns) in the text and knowledge sentences are in bold. External concepts, whether highly related, somewhat related, or unrelated to the image content, are respectively marked in green, orange, and red. If there are more than 3 knowledge sentences, we randomly select 3 sentences from the knowledge extraction results. The colors are manually annotated.
Figure 5: The cosine similarity scores for image-text retrieval of 21 image and text pairs from the UCM-Caption dataset sampled from different scene categories. The horizontal axis represents the text index, and the vertical axis represents the image index.
...and 4 more figures

Knowledge-aware Text-Image Retrieval for Remote Sensing Images

TL;DR

Abstract

Knowledge-aware Text-Image Retrieval for Remote Sensing Images

Authors

TL;DR

Abstract

Table of Contents

Figures (9)