Table of Contents
Fetching ...

GEOBIND: Binding Text, Image, and Audio through Satellite Images

Aayush Dhakal, Subash Khanal, Srikumar Sastry, Adeel Ahmad, Nathan Jacobs

TL;DR

GeoBind addresses the challenge of binding satellite imagery to multiple data modalities (ground-level imagery, text, and audio) by learning a unified embedding space through contrastive, two-stage training. Stage 1 aligns overhead satellite images with ground-level imagery and captions in CLIP space; Stage 2 aligns audio with the resulting satellite embeddings, producing a joint multimodal space. The approach demonstrates cross-modal retrieval across satellite-ground, satellite-text, and satellite-audio, with performance close to modality-specific baselines while offering scalability to additional modalities. This framework enables versatile geospatial reasoning and emergent cross-modal relationships, which can simplify multimodal geospatial analysis and multimedia tagging tasks.

Abstract

In remote sensing, we are interested in modeling various modalities for some geographic location. Several works have focused on learning the relationship between a location and type of landscape, habitability, audio, textual descriptions, etc. Recently, a common way to approach these problems is to train a deep-learning model that uses satellite images to infer some unique characteristics of the location. In this work, we present a deep-learning model, GeoBind, that can infer about multiple modalities, specifically text, image, and audio, from satellite imagery of a location. To do this, we use satellite images as the binding element and contrastively align all other modalities to the satellite image data. Our training results in a joint embedding space with multiple types of data: satellite image, ground-level image, audio, and text. Furthermore, our approach does not require a single complex dataset that contains all the modalities mentioned above. Rather it only requires multiple satellite-image paired data. While we only align three modalities in this paper, we present a general framework that can be used to create an embedding space with any number of modalities by using satellite images as the binding element. Our results show that, unlike traditional unimodal models, GeoBind is versatile and can reason about multiple modalities for a given satellite image input.

GEOBIND: Binding Text, Image, and Audio through Satellite Images

TL;DR

GeoBind addresses the challenge of binding satellite imagery to multiple data modalities (ground-level imagery, text, and audio) by learning a unified embedding space through contrastive, two-stage training. Stage 1 aligns overhead satellite images with ground-level imagery and captions in CLIP space; Stage 2 aligns audio with the resulting satellite embeddings, producing a joint multimodal space. The approach demonstrates cross-modal retrieval across satellite-ground, satellite-text, and satellite-audio, with performance close to modality-specific baselines while offering scalability to additional modalities. This framework enables versatile geospatial reasoning and emergent cross-modal relationships, which can simplify multimodal geospatial analysis and multimedia tagging tasks.

Abstract

In remote sensing, we are interested in modeling various modalities for some geographic location. Several works have focused on learning the relationship between a location and type of landscape, habitability, audio, textual descriptions, etc. Recently, a common way to approach these problems is to train a deep-learning model that uses satellite images to infer some unique characteristics of the location. In this work, we present a deep-learning model, GeoBind, that can infer about multiple modalities, specifically text, image, and audio, from satellite imagery of a location. To do this, we use satellite images as the binding element and contrastively align all other modalities to the satellite image data. Our training results in a joint embedding space with multiple types of data: satellite image, ground-level image, audio, and text. Furthermore, our approach does not require a single complex dataset that contains all the modalities mentioned above. Rather it only requires multiple satellite-image paired data. While we only align three modalities in this paper, we present a general framework that can be used to create an embedding space with any number of modalities by using satellite images as the binding element. Our results show that, unlike traditional unimodal models, GeoBind is versatile and can reason about multiple modalities for a given satellite image input.
Paper Structure (8 sections, 4 equations, 2 figures, 2 tables)

This paper contains 8 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Using CLIP space, our framework creates a joint embedding space where semantically related satellite images, ground-level images, audio, and text data are pushed close together
  • Figure 2: We employ a two-stage training framework. In the first stage, we contrastively update the Satellite Encoder while keeping the CLIP Image Encoder frozen. In the second stage, we contrastively update the Audio Encoder while keeping the Satellite Encoder frozen. It is important to note that more stages can be trivially added to this framework.