Table of Contents
Fetching ...

Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization

Darryl Hannan, John Cooper, Dylan White, Timothy Doster, Henry Kvinge, Yijing Watkins

TL;DR

This work assesses whether recent multimodal large language models with explicit localization capabilities can perform object localization in earth observation imagery, both in zero-shot and few-shot regimes. By benchmarking Molmo, Qwen-2.5-VL, and Llama-3.2 on RarePlanes, Aerial Animal Population, and xBD, the study quantifies centerpoint mAP and contrasts against Faster RCNN to determine data requirements for parity. It reveals that localized MLLMs can excel when targets are large and distinct, with Molmo-72B often leading, while degradation occurs in out-of-domain settings and on fine-grained tasks; prompts and GSD (ground sample distance) selection substantially influence performance. The paper also demonstrates that DoRA-based PEFT can yield modest gains with limited labeled data but may harm generalization, and it highlights practical guidance on prompt design, GSD tiling strategies, and failure modes to inform EO practitioners deploying these models. Overall, the results suggest promising but task-dependent utility of localization-enabled MLLMs for broad-area EO search and rare-object detection, motivating further research into robust fine-tuning and prompt-engineering strategies for remote sensing tasks.

Abstract

Multimodal large language models (MLLMs) have altered the landscape of computer vision, obtaining impressive results across a wide range of tasks, especially in zero-shot settings. Unfortunately, their strong performance does not always transfer to out-of-distribution domains, such as earth observation (EO) imagery. Prior work has demonstrated that MLLMs excel at some EO tasks, such as image captioning and scene understanding, while failing at tasks that require more fine-grained spatial reasoning, such as object localization. However, MLLMs are advancing rapidly and insights quickly become out-dated. In this work, we analyze more recent MLLMs that have been explicitly trained to include fine-grained spatial reasoning capabilities, benchmarking them on EO object localization tasks. We demonstrate that these models are performant in certain settings, making them well suited for zero-shot scenarios. Additionally, we provide a detailed discussion focused on prompt selection, ground sample distance (GSD) optimization, and analyzing failure cases. We hope that this work will prove valuable as others evaluate whether an MLLM is well suited for a given EO localization task and how to optimize it.

Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization

TL;DR

This work assesses whether recent multimodal large language models with explicit localization capabilities can perform object localization in earth observation imagery, both in zero-shot and few-shot regimes. By benchmarking Molmo, Qwen-2.5-VL, and Llama-3.2 on RarePlanes, Aerial Animal Population, and xBD, the study quantifies centerpoint mAP and contrasts against Faster RCNN to determine data requirements for parity. It reveals that localized MLLMs can excel when targets are large and distinct, with Molmo-72B often leading, while degradation occurs in out-of-domain settings and on fine-grained tasks; prompts and GSD (ground sample distance) selection substantially influence performance. The paper also demonstrates that DoRA-based PEFT can yield modest gains with limited labeled data but may harm generalization, and it highlights practical guidance on prompt design, GSD tiling strategies, and failure modes to inform EO practitioners deploying these models. Overall, the results suggest promising but task-dependent utility of localization-enabled MLLMs for broad-area EO search and rare-object detection, motivating further research into robust fine-tuning and prompt-engineering strategies for remote sensing tasks.

Abstract

Multimodal large language models (MLLMs) have altered the landscape of computer vision, obtaining impressive results across a wide range of tasks, especially in zero-shot settings. Unfortunately, their strong performance does not always transfer to out-of-distribution domains, such as earth observation (EO) imagery. Prior work has demonstrated that MLLMs excel at some EO tasks, such as image captioning and scene understanding, while failing at tasks that require more fine-grained spatial reasoning, such as object localization. However, MLLMs are advancing rapidly and insights quickly become out-dated. In this work, we analyze more recent MLLMs that have been explicitly trained to include fine-grained spatial reasoning capabilities, benchmarking them on EO object localization tasks. We demonstrate that these models are performant in certain settings, making them well suited for zero-shot scenarios. Additionally, we provide a detailed discussion focused on prompt selection, ground sample distance (GSD) optimization, and analyzing failure cases. We hope that this work will prove valuable as others evaluate whether an MLLM is well suited for a given EO localization task and how to optimize it.

Paper Structure

This paper contains 24 sections, 5 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Sample outputs using various MLLMs (green dots=ground truth and red Xs=predictions) across three tasks: building detection (left), animal detection (middle), and plane detection (right).
  • Figure 2: Few-shot Faster RCNN faster_rcnn performance with varying amounts of training images (blue lines) vs. top-performing MLLM's performance (alt-color lines) for each task.
  • Figure 3: RarePlanes mAP scores across 50 prompts using Molmo 72B and evaluating on 50 random examples.
  • Figure 4: RarePlanes mAP scores for various prompts using Molmo 7B O and evaluating on 200 random examples. {cat} is the location where each of the categories on the y-axis are inserted.
  • Figure 5: Tiles from the RarePlanes dataset created at various GSDs. Each image is scaled and cropped to 1120 pixels by 1120 pixels, but the spatial extent increases and objects become less resolved at higher GSDs.
  • ...and 16 more figures