Table of Contents
Fetching ...

Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision

Yimei Zhang, Guojiang Shen, Kaili Ning, Tongwei Ren, Xuebo Qiu, Mengmeng Wang, Xiangjie Kong

TL;DR

UrbanLN tackles the challenge of learning urban region representations from noisy, long-caption supervision by introducing Information-Preserved Stretching Interpolation (IPSI) to preserve long-text semantics during cross-modal pre-training, and a dual-level noise suppression pipeline combining multi-model captioning, divide-and-conquer refinement, consensus-based evaluation, and momentum-based self-distillation. The framework leverages a CLIP backbone to fuse long textual descriptions with urban imagery, achieving robust, scalable representations for diverse downstream tasks across multiple cities. Empirical results show UrbanLN consistently outperforms state-of-the-art baselines in city-indicator prediction, with improved transferability and efficiency. The work offers a practical, noise-robust pathway to integrate rich textual knowledge into urban region analysis, enabling more accurate and scalable city analytics.

Abstract

Region representation learning plays a pivotal role in urban computing by extracting meaningful features from unlabeled urban data. Analogous to how perceived facial age reflects an individual's health, the visual appearance of a city serves as its "portrait", encapsulating latent socio-economic and environmental characteristics. Recent studies have explored leveraging Large Language Models (LLMs) to incorporate textual knowledge into imagery-based urban region representation learning. However, two major challenges remain: i) difficulty in aligning fine-grained visual features with long captions, and ii) suboptimal knowledge incorporation due to noise in LLM-generated captions. To address these issues, we propose a novel pre-training framework called UrbanLN that improves Urban region representation learning through Long-text awareness and Noise suppression. Specifically, we introduce an information-preserved stretching interpolation strategy that aligns long captions with fine-grained visual semantics in complex urban scenes. To effectively mine knowledge from LLM-generated captions and filter out noise, we propose a dual-level optimization strategy. At the data level, a multi-model collaboration pipeline automatically generates diverse and reliable captions without human intervention. At the model level, we employ a momentum-based self-distillation mechanism to generate stable pseudo-targets, facilitating robust cross-modal learning under noisy conditions. Extensive experiments across four real-world cities and various downstream tasks demonstrate the superior performance of our UrbanLN.

Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision

TL;DR

UrbanLN tackles the challenge of learning urban region representations from noisy, long-caption supervision by introducing Information-Preserved Stretching Interpolation (IPSI) to preserve long-text semantics during cross-modal pre-training, and a dual-level noise suppression pipeline combining multi-model captioning, divide-and-conquer refinement, consensus-based evaluation, and momentum-based self-distillation. The framework leverages a CLIP backbone to fuse long textual descriptions with urban imagery, achieving robust, scalable representations for diverse downstream tasks across multiple cities. Empirical results show UrbanLN consistently outperforms state-of-the-art baselines in city-indicator prediction, with improved transferability and efficiency. The work offers a practical, noise-robust pathway to integrate rich textual knowledge into urban region analysis, enabling more accurate and scalable city analytics.

Abstract

Region representation learning plays a pivotal role in urban computing by extracting meaningful features from unlabeled urban data. Analogous to how perceived facial age reflects an individual's health, the visual appearance of a city serves as its "portrait", encapsulating latent socio-economic and environmental characteristics. Recent studies have explored leveraging Large Language Models (LLMs) to incorporate textual knowledge into imagery-based urban region representation learning. However, two major challenges remain: i) difficulty in aligning fine-grained visual features with long captions, and ii) suboptimal knowledge incorporation due to noise in LLM-generated captions. To address these issues, we propose a novel pre-training framework called UrbanLN that improves Urban region representation learning through Long-text awareness and Noise suppression. Specifically, we introduce an information-preserved stretching interpolation strategy that aligns long captions with fine-grained visual semantics in complex urban scenes. To effectively mine knowledge from LLM-generated captions and filter out noise, we propose a dual-level optimization strategy. At the data level, a multi-model collaboration pipeline automatically generates diverse and reliable captions without human intervention. At the model level, we employ a momentum-based self-distillation mechanism to generate stable pseudo-targets, facilitating robust cross-modal learning under noisy conditions. Extensive experiments across four real-world cities and various downstream tasks demonstrate the superior performance of our UrbanLN.

Paper Structure

This paper contains 35 sections, 6 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of UrbanLN. The urban imagery input can be either satellite imagery or street-view imagery, with the methodology section primarily focusing on street-view imagery as a representative case.
  • Figure 2: Prediction versus the ground truth on the BJ dataset using satellite imagery. The dotted line is at 45$^\circ$. $R^2_{test}$ and $R^2_{all}$ correspond to the results of testing regions (purple dots) and all regions (blue crosses), respectively.
  • Figure 3: Results of ablation study on $R^2$ metric.
  • Figure 4: Comparison of parameters and inference speed.
  • Figure 5: The $R^2$ for the transferability test on street-view and satellite imagery-based population prediction.
  • ...and 3 more figures