Table of Contents
Fetching ...

GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, Jing Zhang

TL;DR

The paper tackles scaling multimodal LLMs to ultra-high-resolution remote sensing imagery by addressing data scarcity and token explosion. It introduces two UHR RS datasets, SuperRS-VQA and HighRS-VQA, and two token-efficient strategies—Background Token Pruning and Anchored Token Selection—alongside GeoLLaVA-8K, a model capable of processing $8{,}000\times8{,}000$ inputs. GeoLLaVA-8K achieves state-of-the-art results on XLRS-Bench, outperforming larger models at 7B parameters and demonstrating strong generalization and token-efficiency with high-resolution data. The work provides a scalable RS vision-language framework and dataset suite that can drive future UHR RS QA and analysis while highlighting considerations for model scale and sensor modality expansion.

Abstract

Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.

GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

TL;DR

The paper tackles scaling multimodal LLMs to ultra-high-resolution remote sensing imagery by addressing data scarcity and token explosion. It introduces two UHR RS datasets, SuperRS-VQA and HighRS-VQA, and two token-efficient strategies—Background Token Pruning and Anchored Token Selection—alongside GeoLLaVA-8K, a model capable of processing inputs. GeoLLaVA-8K achieves state-of-the-art results on XLRS-Bench, outperforming larger models at 7B parameters and demonstrating strong generalization and token-efficiency with high-resolution data. The work provides a scalable RS vision-language framework and dataset suite that can drive future UHR RS QA and analysis while highlighting considerations for model scale and sensor modality expansion.

Abstract

Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,3768,376) and HighRS-VQA (avg. 2,0001,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.

Paper Structure

This paper contains 33 sections, 3 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Comparison of image resolutions and annotation types
  • Figure 2: Results on XLRS-Bench
  • Figure 3: Example of our dataset
  • Figure 5: SuperRS-VQA
  • Figure 6: HighRS-VQA
  • ...and 9 more figures