Table of Contents
Fetching ...

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

Weihao Xuan, Junjue Wang, Heli Qi, Zihang Chen, Zhuo Zheng, Yanfei Zhong, Junshi Xia, Naoto Yokoya

TL;DR

This work introduces DVL-Suite, a large-scale framework for long-term urban dynamics analysis using remote sensing, comprising DVL-Bench for rigorous multi-temporal evaluation and DVL-Instruct for instruction-tuning. DVL-Bench provides 3,469 multi-temporal images across six tasks (pixel-level change detection, regional analysis, and narrative captioning) over 42 U.S. cities from 2005–2023, with detailed annotations and a consistent taxonomy tailored to urban dynamics. A broad evaluation of 18 MLLMs reveals that current models struggle with long-term temporal reasoning and quantitative analysis, motivating domain-specific data collection and methods. The authors introduce DVLChat, a baseline model built on DVL-Instruct that achieves competitive performance and demonstrates the value of instruction-tuning for dynamic city understanding, while also identifying gaps relative to specialized and commercial baselines. Overall, the work highlights the importance of long-horizon, quantitatively grounded benchmarks and domain-focused datasets for advancing multimodal models in remote-sensing applications with practical urban planning implications.

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in visual understanding, but their application to long-term Earth observation analysis remains limited, primarily focusing on single-temporal or bi-temporal imagery. To address this gap, we introduce DVL-Suite, a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. Our suite comprises 14,871 high-resolution (1.0m) multi-temporal images spanning 42 major cities in the U.S. from 2005 to 2023, organized into two components: DVL-Bench and DVL-Instruct. The DVL-Bench includes six urban understanding tasks, from fundamental change detection (pixel-level) to quantitative analyses (regional-level) and comprehensive urban narratives (scene-level), capturing diverse urban dynamics including expansion/transformation patterns, disaster assessment, and environmental challenges. We evaluate 18 state-of-the-art MLLMs and reveal their limitations in long-term temporal understanding and quantitative analysis. These challenges motivate the creation of DVL-Instruct, a specialized instruction-tuning dataset designed to enhance models' capabilities in multi-temporal Earth observation. Building upon this dataset, we develop DVLChat, a baseline model capable of both image-level question-answering and pixel-level segmentation, facilitating a comprehensive understanding of city dynamics through language interactions.

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

TL;DR

This work introduces DVL-Suite, a large-scale framework for long-term urban dynamics analysis using remote sensing, comprising DVL-Bench for rigorous multi-temporal evaluation and DVL-Instruct for instruction-tuning. DVL-Bench provides 3,469 multi-temporal images across six tasks (pixel-level change detection, regional analysis, and narrative captioning) over 42 U.S. cities from 2005–2023, with detailed annotations and a consistent taxonomy tailored to urban dynamics. A broad evaluation of 18 MLLMs reveals that current models struggle with long-term temporal reasoning and quantitative analysis, motivating domain-specific data collection and methods. The authors introduce DVLChat, a baseline model built on DVL-Instruct that achieves competitive performance and demonstrates the value of instruction-tuning for dynamic city understanding, while also identifying gaps relative to specialized and commercial baselines. Overall, the work highlights the importance of long-horizon, quantitatively grounded benchmarks and domain-focused datasets for advancing multimodal models in remote-sensing applications with practical urban planning implications.

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in visual understanding, but their application to long-term Earth observation analysis remains limited, primarily focusing on single-temporal or bi-temporal imagery. To address this gap, we introduce DVL-Suite, a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. Our suite comprises 14,871 high-resolution (1.0m) multi-temporal images spanning 42 major cities in the U.S. from 2005 to 2023, organized into two components: DVL-Bench and DVL-Instruct. The DVL-Bench includes six urban understanding tasks, from fundamental change detection (pixel-level) to quantitative analyses (regional-level) and comprehensive urban narratives (scene-level), capturing diverse urban dynamics including expansion/transformation patterns, disaster assessment, and environmental challenges. We evaluate 18 state-of-the-art MLLMs and reveal their limitations in long-term temporal understanding and quantitative analysis. These challenges motivate the creation of DVL-Instruct, a specialized instruction-tuning dataset designed to enhance models' capabilities in multi-temporal Earth observation. Building upon this dataset, we develop DVLChat, a baseline model capable of both image-level question-answering and pixel-level segmentation, facilitating a comprehensive understanding of city dynamics through language interactions.

Paper Structure

This paper contains 11 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Diverse tasks in the DVL-Bench. Our framework encompasses multiple levels of temporal understanding: from pixel-precise change detection and quantification to regional evolution analysis and dense temporal captioning. This hierarchical task design enables systematic evaluation of MLLMs' capabilities in multi-temporal Earth observation understanding.
  • Figure 2: The annotation pipeline of the proposed DVL-Suite. Four common urban dynamics are depicted from top to bottom: partial urban reconstruction, natural disasters, farmland conversion, and homeless encampments. In our semi-auto pipeline, urban experts perform the basic annotations, while GPT4.1 integrates this information to generate the paired instructions.
  • Figure 3: Task taxonomy and sample distribution in DVL-Bench. The multi-level task evaluates MLLM comprehensively.
  • Figure 4: Data distributions across the 42 rapidly growing cities and the temporal number floats from five to ten.
  • Figure 5: The basic change flow in DVL-Bench.
  • ...and 5 more figures