Table of Contents
Fetching ...

FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing

Isaac Corley, Simone Fobi Nsutezo, Anthony Ortiz, Caleb Robinson, Rahul Dodhia, Juan M. Lavista Ferres, Peyman Najafirad

TL;DR

FLAVARS tackles the trade-off in multimodal remote-sensing pretraining by fusing FLAVA-style masked modeling and contrastive learning with an explicit geospatial alignment objective. It introduces the SkyScript-Grounded dataset and a SatCLIP-informed location encoder to jointly align images, text, and coordinates, yielding improved vision-only representations as evidenced by KNN and SpaceNet1 segmentation gains, while retaining zero-shot and retrieval capabilities. Although CLIP-based pretraining achieves stronger zero-shot alignment, FLAVARS offers a balanced alternative that enhances dense-vision performance without sacrificing cross-modal utility. The work highlights the practical impact of incorporating geospatial awareness into multimodal RS pretraining and points to future work on mitigating the remaining trade-offs between dense-vision tasks and multimodal alignment.

Abstract

Remote sensing imagery is dense with objects and contextual visual information. There is a recent trend to combine paired satellite images and text captions for pretraining performant encoders for downstream tasks. However, while contrastive image-text methods like CLIP enable vision-language alignment and zero-shot classification ability, vision-only downstream performance tends to degrade compared to image-only pretraining, such as MAE. In this paper, we propose FLAVARS, a pretraining method that combines the best of both contrastive learning and masked modeling, along with geospatial alignment via contrastive location encoding. We find that FLAVARS significantly outperforms a baseline of SkyCLIP for vision-only tasks such as KNN classification and semantic segmentation, +6\% mIOU on SpaceNet1, while retaining the ability to perform zero-shot classification, unlike MAE pretrained methods.

FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing

TL;DR

FLAVARS tackles the trade-off in multimodal remote-sensing pretraining by fusing FLAVA-style masked modeling and contrastive learning with an explicit geospatial alignment objective. It introduces the SkyScript-Grounded dataset and a SatCLIP-informed location encoder to jointly align images, text, and coordinates, yielding improved vision-only representations as evidenced by KNN and SpaceNet1 segmentation gains, while retaining zero-shot and retrieval capabilities. Although CLIP-based pretraining achieves stronger zero-shot alignment, FLAVARS offers a balanced alternative that enhances dense-vision performance without sacrificing cross-modal utility. The work highlights the practical impact of incorporating geospatial awareness into multimodal RS pretraining and points to future work on mitigating the remaining trade-offs between dense-vision tasks and multimodal alignment.

Abstract

Remote sensing imagery is dense with objects and contextual visual information. There is a recent trend to combine paired satellite images and text captions for pretraining performant encoders for downstream tasks. However, while contrastive image-text methods like CLIP enable vision-language alignment and zero-shot classification ability, vision-only downstream performance tends to degrade compared to image-only pretraining, such as MAE. In this paper, we propose FLAVARS, a pretraining method that combines the best of both contrastive learning and masked modeling, along with geospatial alignment via contrastive location encoding. We find that FLAVARS significantly outperforms a baseline of SkyCLIP for vision-only tasks such as KNN classification and semantic segmentation, +6\% mIOU on SpaceNet1, while retaining the ability to perform zero-shot classification, unlike MAE pretrained methods.
Paper Structure (15 sections, 2 figures, 3 tables)

This paper contains 15 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The architecture of our proposed FLAVARS vision-language-location pretraining framework. The components consist of the original FLAVA masked-image modeling, masked-language-modeling, multimodal image-text matching and global image-text contrastive losses. In addition, we combine these with a geospatial coordinate location-image global contrastive loss which we use to align images, text, and their geospatial coordinates.
  • Figure 2: A sample from our SkyScript-Grounded dataset. We improve the original captions in the SkyScript dataset using GPT-4V by prompting with a caption improvement and localization instruction along with the image of interest. The grounded captions contain bounding-box pixel coordinates encompassing the objects in the image with correponding OSM tags. As an example, we plot the resulting boxes on the sample image above.