Table of Contents
Fetching ...

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, Yong Li

TL;DR

UrbanLLaVA addresses the fragmentation of urban data by introducing a unified multi-modal framework that jointly processes four urban data types. It builds UData to synthesize diverse, city-scale instruction data, employs a three-stage UTrain to decouple reasoning from domain knowledge, and uses an enhanced UBench to benchmark cross-modal urban tasks. The method achieves consistent, substantial improvements over strong baselines across three cities and demonstrates transferable gains to different base models, underscoring its potential for general urban intelligence. The work provides a practical path toward scalable, spatially aware urban cognition with open-source data and pipelines for the research community.

Abstract

Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

TL;DR

UrbanLLaVA addresses the fragmentation of urban data by introducing a unified multi-modal framework that jointly processes four urban data types. It builds UData to synthesize diverse, city-scale instruction data, employs a three-stage UTrain to decouple reasoning from domain knowledge, and uses an enhanced UBench to benchmark cross-modal urban tasks. The method achieves consistent, substantial improvements over strong baselines across three cities and demonstrates transferable gains to different base models, underscoring its potential for general urban intelligence. The work provides a practical path toward scalable, spatially aware urban cognition with open-source data and pipelines for the research community.

Abstract

Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce , a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In , we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.

Paper Structure

This paper contains 29 sections, 36 figures, 11 tables.

Figures (36)

  • Figure 1: Existing works vs. our UrbanLLaVA in urban research.
  • Figure 2: The framework of UrbanLLaVA, including UData, UTrain and UBench.
  • Figure 3: The thorough composition of UData in Beijing.
  • Figure 4: UTrain: three-stage training pipeline.
  • Figure 5: Performance of different training strategies. ‘K’ refers to knowledge learning, ‘TA’ refers to task alignment, and ‘Mix’ refers to mixture learning. ‘One stage: K + TA’ means knowledge learning and task alignment are merged in the same stage. ‘Two stage: TA$\rightarrow$K’ means task alignment first then knowledge learning in the second stage. ‘Three stage: TA$\rightarrow$K$\rightarrow$Mix’ adds a step in the third stage: mixture learning. The tasks detailed in the table are those with significant differences across different training strategies, while ‘Others’ refers to other tasks in UBench with smaller differences.
  • ...and 31 more figures