REO-VLM: Transforming VLM to Meet Regression Challenges in Earth Observation
Xizhe Xue, Guoting Wei, Hao Chen, Haokui Zhang, Feng Lin, Chunhua Shen, Xiao Xiang Zhu
TL;DR
This work investigates Vision Language Models (VLMs) for Earth Observation (EO) regression by introducing the REO-Instruct benchmark and the unified REO-VLM model. REO-Instruct combines 1.6 million multimodal EO image–text pairs across RGB, multispectral, and SAR data with domain-text annotations to support both regression (e.g., Above Ground Biomass, $AGB$) and generation tasks, enabling knowledge-driven reasoning. REO-VLM extends LLaVA-1.5 with spectral recombination, a reverse projection module, and a regression head, employing a two-stage training regime to align language-driven reasoning with numeric outputs. Experiments across land cover classification, VQA-based human activity monitoring, ecological patch counting, and $AGB$ regression show that multimodal inputs and domain-informed training improve performance, though numeric regression remains challenging and benefits from balanced multi-layer visual features. The work highlights the potential of integrating domain knowledge in multimodal EO models and points to future directions in higher-resolution data, additional modalities, and uncertainty quantification to enhance reliability and interpretability.
Abstract
The rapid evolution of Vision Language Models (VLMs) has catalyzed significant advancements in artificial intelligence, expanding research across various disciplines, including Earth Observation (EO). While VLMs have enhanced image understanding and data processing within EO, their applications have predominantly focused on image content description. This limited focus overlooks their potential in geographic and scientific regression tasks, which are essential for diverse EO applications. To bridge this gap, this paper introduces a novel benchmark dataset, called \textbf{REO-Instruct} to unify regression and generation tasks specifically for the EO domain. Comprising 1.6 million multimodal EO imagery and language pairs, this dataset is designed to support both biomass regression and image content interpretation tasks. Leveraging this dataset, we develop \textbf{REO-VLM}, a groundbreaking model that seamlessly integrates regression capabilities with traditional generative functions. By utilizing language-driven reasoning to incorporate scientific domain knowledge, REO-VLM goes beyond solely relying on EO imagery, enabling comprehensive interpretation of complex scientific attributes from EO data. This approach establishes new performance benchmarks and significantly enhances the capabilities of environmental monitoring and resource management.
