Composed Image Retrieval for Remote Sensing

Bill Psomas; Ioannis Kakogeorgiou; Nikos Efthymiadis; Giorgos Tolias; Ondrej Chum; Yannis Avrithis; Konstantinos Karantzalos

Composed Image Retrieval for Remote Sensing

Bill Psomas, Ioannis Kakogeorgiou, Nikos Efthymiadis, Giorgos Tolias, Ondrej Chum, Yannis Avrithis, Konstantinos Karantzalos

TL;DR

This work tackles the limitation of single-modality queries in remote sensing image retrieval by introducing composed image retrieval (CIR) that combines an image query with a textual modification. It presents WeiCom, a training-free method that fuses image- and text-based similarities through a similarity normalization step and a modality-control parameter $\lambda$, along with PatternCom as a dedicated RS-CIR benchmark. Experiments with CLIP and RemoteCLIP demonstrate state-of-the-art performance and highlight that the optimal balance between modalities depends on the encoder. The work enables more expressive, zero-shot retrieval in earth observation and establishes a foundation for multimodal RSIR research.

Abstract

This work introduces composed image retrieval to remote sensing. It allows to query a large image archive by image examples alternated by a textual description, enriching the descriptive power over unimodal queries, either visual or textual. Various attributes can be modified by the textual part, such as shape, color, or context. A novel method fusing image-to-image and text-to-image similarity is introduced. We demonstrate that a vision-language model possesses sufficient descriptive power and no further learning step or training data are necessary. We present a new evaluation benchmark focused on color, context, density, existence, quantity, and shape modifications. Our work not only sets the state-of-the-art for this task, but also serves as a foundational step in addressing a gap in the field of remote sensing image retrieval. Code at: https://github.com/billpsomas/rscir

Composed Image Retrieval for Remote Sensing

TL;DR

, along with PatternCom as a dedicated RS-CIR benchmark. Experiments with CLIP and RemoteCLIP demonstrate state-of-the-art performance and highlight that the optimal balance between modalities depends on the encoder. The work enables more expressive, zero-shot retrieval in earth observation and establishes a foundation for multimodal RSIR research.

Abstract

Paper Structure (11 sections, 2 equations, 2 figures, 3 tables)

This paper contains 11 sections, 2 equations, 2 figures, 3 tables.

Introduction
Related Work
Method
Problem formulation
Baselines
WeiCom
Experiments
Datasets, networks and evaluation protocol
Experimental results
Ablation study
Conclusions

Figures (2)

Figure 1: WeiCom: A Weighted Composed Image Retrieval Method. It utilizes a dual-encoder approach to process both query image $y$ and query text $t$. Initially, the query image is passed into a visual encoder $f$ and the query text into a text encoder $g$, producing corresponding $d$-dimensional representations. Subsequently, similarity scores with the representations in the image dataset are calculated. These scores are then normalized and combined using a convex combination controlled by a $\lambda \in [0,1]$. Finally, an argmax(argsort) operation identifies the most relevant retrieved image(s) $x$.
Figure 2: Demonstrating remote sensing composed image retrieval. Subfigures (a) to (h) depict the key attributes: color, context, density, existence, quantity, shape, size, and texture. Each one illustrates various utilizations of composed image retrieval in remote sensing. Subfigures (b), (d) are examples that extend the task to multiple classes and attributes.

Composed Image Retrieval for Remote Sensing

TL;DR

Abstract

Composed Image Retrieval for Remote Sensing

Authors

TL;DR

Abstract

Table of Contents

Figures (2)