Table of Contents
Fetching ...

ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation

Hosam Elgendy, Ahmed Sharshar, Ahmed Aboeitta, Mohsen Guizani

TL;DR

ChatENV presents an interactive vision–language framework that fuses satellite image pairs with real-world environmental sensor data to enable grounded, scenario-based environmental reasoning. By building a large sensor-aware temporal satellite dataset and employing a dual-model annotation pipeline (GPT-4o and Gemini 2.0), the authors fine-tune a Qwen-2.5-VL backbone with LoRA adapters to support single-turn descriptions, what-if analyses, and three-turn difference queries. The approach achieves strong temporal and interactive reasoning performance, outperforming several baselines and demonstrating the value of sensor data in explaining environmental changes. This work advances practical environmental monitoring by providing a grounded, sensor-aware, interactive tool with potential for real-time deployment and expanded multimodal fusion.

Abstract

Understanding environmental changes from remote sensing imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and "what-if" reasoning (e.g., BERTF1 0.902) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.

ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation

TL;DR

ChatENV presents an interactive vision–language framework that fuses satellite image pairs with real-world environmental sensor data to enable grounded, scenario-based environmental reasoning. By building a large sensor-aware temporal satellite dataset and employing a dual-model annotation pipeline (GPT-4o and Gemini 2.0), the authors fine-tune a Qwen-2.5-VL backbone with LoRA adapters to support single-turn descriptions, what-if analyses, and three-turn difference queries. The approach achieves strong temporal and interactive reasoning performance, outperforming several baselines and demonstrating the value of sensor data in explaining environmental changes. This work advances practical environmental monitoring by providing a grounded, sensor-aware, interactive tool with potential for real-time deployment and expanded multimodal fusion.

Abstract

Understanding environmental changes from remote sensing imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and "what-if" reasoning (e.g., BERTF1 0.902) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.

Paper Structure

This paper contains 12 sections, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Pipeline overview for ChatENV. Aerial RGB tiles and sensor-tagged prompts (e.g., temperature, humidity, CO2) are encoded via frozen Qwen 2.5 ViT and text encoders, respectively. Their embeddings are projected into a shared space to condition a Qwen 2.5 decoder, with only LoRA adapters and an optional linear probe trained. Token-level cross-entropy on enriched captions trains the model to (i) describe scenes, (ii) reason over current environmental data, and (iii) answer “what-if” queries.
  • Figure 2: Visual overview of the preprocessing pipeline for environmental change analysis. The process starts with satellite imagery sourcing (Data Source), followed by temporal pairing (Image Pairing), integration of Weather Data, enrichment with emissions (Emission Data), and annotation via GPT-4o and Gemini 2.0 (Annotation Generation).
  • Figure 3: Total distribution of samples by score through manual evaluation of the testing set. For each sample, a rating of 1-5 was given over three criteria. Samples with a total score over 9 points were kept as the testing set.
  • Figure 4: Treemap visualization showing the distribution of satellite image counts by country and category. Larger rectangles indicate more frequent object classes in the dataset, such as swimming pools, stadiums, and roads. The spatial diversity across countries, like the prevalence of crop fields in France and Italy, or urban structures in the United States and the Russian Federation, highlights the broad geographic and semantic coverage, which is critical for robust change analysis.
  • Figure 5: The figure illustrates a what-if interaction with ChatENV. Given the initial image and environmental metadata, the user poses a scenario question: “What will happen if construction is done, and many buildings are built over this patch of land?”The model generates a detailed answer that closely matches the second (ground-truth) image’s description, such as increased PM10 due to dust and traffic, decreased $NO_2$ following construction, altered wind patterns, and heat retention from concrete.