Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization
Darryl Hannan, John Cooper, Dylan White, Timothy Doster, Henry Kvinge, Yijing Watkins
TL;DR
This work assesses whether recent multimodal large language models with explicit localization capabilities can perform object localization in earth observation imagery, both in zero-shot and few-shot regimes. By benchmarking Molmo, Qwen-2.5-VL, and Llama-3.2 on RarePlanes, Aerial Animal Population, and xBD, the study quantifies centerpoint mAP and contrasts against Faster RCNN to determine data requirements for parity. It reveals that localized MLLMs can excel when targets are large and distinct, with Molmo-72B often leading, while degradation occurs in out-of-domain settings and on fine-grained tasks; prompts and GSD (ground sample distance) selection substantially influence performance. The paper also demonstrates that DoRA-based PEFT can yield modest gains with limited labeled data but may harm generalization, and it highlights practical guidance on prompt design, GSD tiling strategies, and failure modes to inform EO practitioners deploying these models. Overall, the results suggest promising but task-dependent utility of localization-enabled MLLMs for broad-area EO search and rare-object detection, motivating further research into robust fine-tuning and prompt-engineering strategies for remote sensing tasks.
Abstract
Multimodal large language models (MLLMs) have altered the landscape of computer vision, obtaining impressive results across a wide range of tasks, especially in zero-shot settings. Unfortunately, their strong performance does not always transfer to out-of-distribution domains, such as earth observation (EO) imagery. Prior work has demonstrated that MLLMs excel at some EO tasks, such as image captioning and scene understanding, while failing at tasks that require more fine-grained spatial reasoning, such as object localization. However, MLLMs are advancing rapidly and insights quickly become out-dated. In this work, we analyze more recent MLLMs that have been explicitly trained to include fine-grained spatial reasoning capabilities, benchmarking them on EO object localization tasks. We demonstrate that these models are performant in certain settings, making them well suited for zero-shot scenarios. Additionally, we provide a detailed discussion focused on prompt selection, ground sample distance (GSD) optimization, and analyzing failure cases. We hope that this work will prove valuable as others evaluate whether an MLLM is well suited for a given EO localization task and how to optimize it.
