Table of Contents
Fetching ...

Composed Image Retrieval for Remote Sensing

Bill Psomas, Ioannis Kakogeorgiou, Nikos Efthymiadis, Giorgos Tolias, Ondrej Chum, Yannis Avrithis, Konstantinos Karantzalos

TL;DR

This work tackles the limitation of single-modality queries in remote sensing image retrieval by introducing composed image retrieval (CIR) that combines an image query with a textual modification. It presents WeiCom, a training-free method that fuses image- and text-based similarities through a similarity normalization step and a modality-control parameter $\lambda$, along with PatternCom as a dedicated RS-CIR benchmark. Experiments with CLIP and RemoteCLIP demonstrate state-of-the-art performance and highlight that the optimal balance between modalities depends on the encoder. The work enables more expressive, zero-shot retrieval in earth observation and establishes a foundation for multimodal RSIR research.

Abstract

This work introduces composed image retrieval to remote sensing. It allows to query a large image archive by image examples alternated by a textual description, enriching the descriptive power over unimodal queries, either visual or textual. Various attributes can be modified by the textual part, such as shape, color, or context. A novel method fusing image-to-image and text-to-image similarity is introduced. We demonstrate that a vision-language model possesses sufficient descriptive power and no further learning step or training data are necessary. We present a new evaluation benchmark focused on color, context, density, existence, quantity, and shape modifications. Our work not only sets the state-of-the-art for this task, but also serves as a foundational step in addressing a gap in the field of remote sensing image retrieval. Code at: https://github.com/billpsomas/rscir

Composed Image Retrieval for Remote Sensing

TL;DR

This work tackles the limitation of single-modality queries in remote sensing image retrieval by introducing composed image retrieval (CIR) that combines an image query with a textual modification. It presents WeiCom, a training-free method that fuses image- and text-based similarities through a similarity normalization step and a modality-control parameter , along with PatternCom as a dedicated RS-CIR benchmark. Experiments with CLIP and RemoteCLIP demonstrate state-of-the-art performance and highlight that the optimal balance between modalities depends on the encoder. The work enables more expressive, zero-shot retrieval in earth observation and establishes a foundation for multimodal RSIR research.

Abstract

This work introduces composed image retrieval to remote sensing. It allows to query a large image archive by image examples alternated by a textual description, enriching the descriptive power over unimodal queries, either visual or textual. Various attributes can be modified by the textual part, such as shape, color, or context. A novel method fusing image-to-image and text-to-image similarity is introduced. We demonstrate that a vision-language model possesses sufficient descriptive power and no further learning step or training data are necessary. We present a new evaluation benchmark focused on color, context, density, existence, quantity, and shape modifications. Our work not only sets the state-of-the-art for this task, but also serves as a foundational step in addressing a gap in the field of remote sensing image retrieval. Code at: https://github.com/billpsomas/rscir
Paper Structure (11 sections, 2 equations, 2 figures, 3 tables)

This paper contains 11 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: WeiCom: A Weighted Composed Image Retrieval Method. It utilizes a dual-encoder approach to process both query image $y$ and query text $t$. Initially, the query image is passed into a visual encoder $f$ and the query text into a text encoder $g$, producing corresponding $d$-dimensional representations. Subsequently, similarity scores with the representations in the image dataset are calculated. These scores are then normalized and combined using a convex combination controlled by a $\lambda \in [0,1]$. Finally, an argmax(argsort) operation identifies the most relevant retrieved image(s) $x$.
  • Figure 2: Demonstrating remote sensing composed image retrieval. Subfigures (a) to (h) depict the key attributes: color, context, density, existence, quantity, shape, size, and texture. Each one illustrates various utilizations of composed image retrieval in remote sensing. Subfigures (b), (d) are examples that extend the task to multiple classes and attributes.