Table of Contents
Fetching ...

Large Language Models for Captioning and Retrieving Remote Sensing Images

João Daniel Silva, João Magalhães, Devis Tuia, Bruno Martins

TL;DR

This work tackles captioning and cross-modal retrieval for remote sensing images under limited data by freezing a large language model and a remote-sensing-tuned visual encoder, aligning them via linear projections and a [RET] token. It introduces RS-CapRet, which jointly trains image-captioning and contrastive retrieval objectives while keeping most parameters frozen, yielding competitive or state-of-the-art results on NWPU-Captions and RSICD and enabling dialogue-style multi-modal interactions. A key design choice is finetuning CLIP on Cap-4 data (Cap-4) to form a strong RS-capable vision backbone (CLIP-Cap-4), with LLamaV2-7B as the decoder, and using a lightweight projection network to bridge modalities. The results, analyses, and qualitative examples underscore the model’s potential for practical RS data exploration and multi-turn querying, while highlighting the importance of dataset scale and domain-specific embeddings for generalization and performance gains.

Abstract

Image captioning and cross-modal retrieval are examples of tasks that involve the joint analysis of visual and linguistic information. In connection to remote sensing imagery, these tasks can help non-expert users in extracting relevant Earth observation information for a variety of applications. Still, despite some previous efforts, the development and application of vision and language models to the remote sensing domain have been hindered by the relatively small size of the available datasets and models used in previous studies. In this work, we propose RS-CapRet, a Vision and Language method for remote sensing tasks, in particular image captioning and text-image retrieval. We specifically propose to use a highly capable large decoder language model together with image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge together the image encoder and language decoder, we propose training simple linear layers with examples from combining different remote sensing image captioning datasets, keeping the other parameters frozen. RS-CapRet can then generate descriptions for remote sensing images and retrieve images from textual descriptions, achieving SOTA or competitive performance with existing methods. Qualitative results illustrate that RS-CapRet can effectively leverage the pre-trained large language model to describe remote sensing images, retrieve them based on different types of queries, and also show the ability to process interleaved sequences of images and text in a dialogue manner.

Large Language Models for Captioning and Retrieving Remote Sensing Images

TL;DR

This work tackles captioning and cross-modal retrieval for remote sensing images under limited data by freezing a large language model and a remote-sensing-tuned visual encoder, aligning them via linear projections and a [RET] token. It introduces RS-CapRet, which jointly trains image-captioning and contrastive retrieval objectives while keeping most parameters frozen, yielding competitive or state-of-the-art results on NWPU-Captions and RSICD and enabling dialogue-style multi-modal interactions. A key design choice is finetuning CLIP on Cap-4 data (Cap-4) to form a strong RS-capable vision backbone (CLIP-Cap-4), with LLamaV2-7B as the decoder, and using a lightweight projection network to bridge modalities. The results, analyses, and qualitative examples underscore the model’s potential for practical RS data exploration and multi-turn querying, while highlighting the importance of dataset scale and domain-specific embeddings for generalization and performance gains.

Abstract

Image captioning and cross-modal retrieval are examples of tasks that involve the joint analysis of visual and linguistic information. In connection to remote sensing imagery, these tasks can help non-expert users in extracting relevant Earth observation information for a variety of applications. Still, despite some previous efforts, the development and application of vision and language models to the remote sensing domain have been hindered by the relatively small size of the available datasets and models used in previous studies. In this work, we propose RS-CapRet, a Vision and Language method for remote sensing tasks, in particular image captioning and text-image retrieval. We specifically propose to use a highly capable large decoder language model together with image encoders adapted to remote sensing imagery through contrastive language-image pre-training. To bridge together the image encoder and language decoder, we propose training simple linear layers with examples from combining different remote sensing image captioning datasets, keeping the other parameters frozen. RS-CapRet can then generate descriptions for remote sensing images and retrieve images from textual descriptions, achieving SOTA or competitive performance with existing methods. Qualitative results illustrate that RS-CapRet can effectively leverage the pre-trained large language model to describe remote sensing images, retrieve them based on different types of queries, and also show the ability to process interleaved sequences of images and text in a dialogue manner.
Paper Structure (28 sections, 4 equations, 5 figures, 5 tables)

This paper contains 28 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of the method used. Left: CLIP is finetuned to the remote sensing domain with image-text pairs from image captioning datasets. Middle: Image captioning task where image embeddings are obtained via a frozen image encoder and projected with a trainable linear layer to the input embedding space of the frozen large language model, which are then concatenated with the input text. Right: Trainable linear layers to apply contrastive learning between image representations and a special $\mathop{\mathrm{[RET]}}\nolimits$ token to address text-image retrieval.
  • Figure 2: Qualitative examples of generated captions given images of different classes of the test-set of NWPU-Captions dataset cheng2022NWPUCaptions.
  • Figure 3: Examples image retrieval by RS-CapRet given different requests by the user, considering object features and related topics.
  • Figure 4: Examples of dialogue with RS-CapRet, showing a) the ability to handle multi-modal inputs with interleaved sequences of images and text as well as b) reasoning abilities given world knowledge.
  • Figure 5: In-context learning ability of RS-CapRet: Given one example of the correct class of the input image, RS-CapRet can generate an accurate description where it had before failed.