Table of Contents
Fetching ...

Multilingual Vision-Language Pre-training for the Remote Sensing Domain

João Daniel Silva, Joao Magalhaes, Devis Tuia, Bruno Martins

TL;DR

This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model and testing the use of a self-supervised method based on aligning local and global representations from individual input images, together with the standard CLIP objective.

Abstract

Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data, such as cross-modal retrieval. The adaptation of CLIP to this specific domain has relied on model fine-tuning with the standard contrastive objective, using existing human-labeled image-caption datasets, or using synthetic data corresponding to image-caption pairs derived from other annotations over remote sensing images (e.g., object classes). The use of different pre-training mechanisms has received less attention, and only a few exceptions have considered multilingual inputs. This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model and testing the use of a self-supervised method based on aligning local and global representations from individual input images, together with the standard CLIP objective. Model training relied on assembling pre-existing datasets of remote sensing images paired with English captions, followed by the use of automated machine translation into nine additional languages. We show that translated data is indeed helpful, e.g. improving performance also on English. Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks, including cross-modal and multilingual image-text retrieval, or zero-shot image classification.

Multilingual Vision-Language Pre-training for the Remote Sensing Domain

TL;DR

This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model and testing the use of a self-supervised method based on aligning local and global representations from individual input images, together with the standard CLIP objective.

Abstract

Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data, such as cross-modal retrieval. The adaptation of CLIP to this specific domain has relied on model fine-tuning with the standard contrastive objective, using existing human-labeled image-caption datasets, or using synthetic data corresponding to image-caption pairs derived from other annotations over remote sensing images (e.g., object classes). The use of different pre-training mechanisms has received less attention, and only a few exceptions have considered multilingual inputs. This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model and testing the use of a self-supervised method based on aligning local and global representations from individual input images, together with the standard CLIP objective. Model training relied on assembling pre-existing datasets of remote sensing images paired with English captions, followed by the use of automated machine translation into nine additional languages. We show that translated data is indeed helpful, e.g. improving performance also on English. Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks, including cross-modal and multilingual image-text retrieval, or zero-shot image classification.

Paper Structure

This paper contains 26 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of our method, where self-distillation is combined with contrastive learning to fine-tune a CLIP model using image-caption pairs from remote sensing datasets. Following DINO naeem2023silccaron2021emerging, local and global image crops are fed into student and teacher networks. Each network produces a probability distribution of size $K$, and the student learns to match the teacher's distribution using a cross-entropy loss, aligning local and global views. The student is updated via gradient updates, while the teacher is updated with an Exponential Moving Average (EMA). The standard CLIP loss for matching image-text pairs is also applied.
  • Figure 2: Zero-shot classification performance on the EuroSAT dataset, across different languages.
  • Figure 3: Examples for image retrieval, ordered by similarity, given different captions in English, Portuguese, and French.
  • Figure 4: Semantic masks obtained with the NACLIP hajimiri2024pay method from RS-M-CLIP and CLIP, across different scenes.
  • Figure 5: Examples of captions generated with an approach based on LMCap ramos2023lmcap, using an LLM for generating captions conditioned on retrieved examples obtained by RS-M-CLIP.