Table of Contents
Fetching ...

Unsupervised Urban Land Use Mapping with Street View Contrastive Clustering and a Geographical Prior

Lin Che, Yizi Chen, Tanhua Jin, Martin Raubal, Konrad Schindler, Peter Kiefer

TL;DR

This work tackles the challenge of urban land use mapping without labeled data by leveraging ground-level street-view images. It introduces Contrastive Clustering with Geographical Priors (CCGP), which fuses visual similarity with spatial proximity to learn coherent representations and cluster assignments, guided by Tobler's law. A post-clustering visual assignment (PCVA) step then translates clusters into actionable land-use categories, enabling grid-based map generation. Experiments on Milan and San Francisco street-view datasets show that CCGP, especially with PCVA (CCGP-PCVA), achieves superior clustering quality and spatial coherence compared to baselines, providing a scalable solution for city-scale land-use mapping and updating without labeled data.

Abstract

Urban land use classification and mapping are critical for urban planning, resource management, and environmental monitoring. Existing remote sensing techniques often lack precision in complex urban environments due to the absence of ground-level details. Unlike aerial perspectives, street view images provide a ground-level view that captures more human and social activities relevant to land use in complex urban scenes. Existing street view-based methods primarily rely on supervised classification, which is challenged by the scarcity of high-quality labeled data and the difficulty of generalizing across diverse urban landscapes. This study introduces an unsupervised contrastive clustering model for street view images with a built-in geographical prior, to enhance clustering performance. When combined with a simple visual assignment of the clusters, our approach offers a flexible and customizable solution to land use mapping, tailored to the specific needs of urban planners. We experimentally show that our method can generate land use maps from geotagged street view image datasets of two cities. As our methodology relies on the universal spatial coherence of geospatial data ("Tobler's law"), it can be adapted to various settings where street view images are available, to enable scalable, unsupervised land use mapping and updating. The code will be available at https://github.com/lin102/CCGP.

Unsupervised Urban Land Use Mapping with Street View Contrastive Clustering and a Geographical Prior

TL;DR

This work tackles the challenge of urban land use mapping without labeled data by leveraging ground-level street-view images. It introduces Contrastive Clustering with Geographical Priors (CCGP), which fuses visual similarity with spatial proximity to learn coherent representations and cluster assignments, guided by Tobler's law. A post-clustering visual assignment (PCVA) step then translates clusters into actionable land-use categories, enabling grid-based map generation. Experiments on Milan and San Francisco street-view datasets show that CCGP, especially with PCVA (CCGP-PCVA), achieves superior clustering quality and spatial coherence compared to baselines, providing a scalable solution for city-scale land-use mapping and updating without labeled data.

Abstract

Urban land use classification and mapping are critical for urban planning, resource management, and environmental monitoring. Existing remote sensing techniques often lack precision in complex urban environments due to the absence of ground-level details. Unlike aerial perspectives, street view images provide a ground-level view that captures more human and social activities relevant to land use in complex urban scenes. Existing street view-based methods primarily rely on supervised classification, which is challenged by the scarcity of high-quality labeled data and the difficulty of generalizing across diverse urban landscapes. This study introduces an unsupervised contrastive clustering model for street view images with a built-in geographical prior, to enhance clustering performance. When combined with a simple visual assignment of the clusters, our approach offers a flexible and customizable solution to land use mapping, tailored to the specific needs of urban planners. We experimentally show that our method can generate land use maps from geotagged street view image datasets of two cities. As our methodology relies on the universal spatial coherence of geospatial data ("Tobler's law"), it can be adapted to various settings where street view images are available, to enable scalable, unsupervised land use mapping and updating. The code will be available at https://github.com/lin102/CCGP.

Paper Structure

This paper contains 21 sections, 7 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: A motivational example illustrating the importance of spatial consistency. Two SVIs were taken within 20 meters of each other. Image (a) shows numerous pedestrians and shops, suggesting a commercial area. However, the adjacent image (b) contains dynamic objects such as trams and construction activities that occlude many of those cues, leading to significant visual differences between SVIs of the same land use type.
  • Figure 2: The proposed framework consists of three components: CCGP network (orange), PCVA (blue), and Grid Map Generation (green). CCGP selects spatially close images as positive samples, enriches them with data augmentation, and learns instance and cluster representations. PCVA assigns land use labels to clusters by manual interpretation of a representative high-confidence SVI for each cluster. Grid Map Generation aggregates the land use labels assigned to SVIs into a dense raster map.
  • Figure 3: Land use mapping by clustering SVIs. Results at 100m$\times$100m grid resolution are shown for $k$-means, PICA, CC, CCGP, and CCGP-PCVA. Colors denote different land use categories as per the legend.
  • Figure 4: 2D t-SNE visualization of features learned by $k$-means, PICA, CC, and CCCP for the San Francisco dataset.
  • Figure 5: Mean accuracy for varying $K$ between 1 and 50. Shaded regions denote standard error bars.
  • ...and 2 more figures