Learning Street View Representations with Spatiotemporal Contrast
Yong Li, Yingjing Huang, Gengchen Mai, Fan Zhang
TL;DR
This work tackles how to learn street-view representations that capture both dynamic urban elements and ambient cues by introducing a self-supervised spatiotemporal contrastive framework. It defines temporal-invariance at a fixed location, spatial-invariance across nearby areas, and a global information representation to tailor features for downstream urban tasks, trained with an InfoNCE-based objective. Pretraining on a massive, geographically diverse street-view corpus yields task-specific benefits: temporal contrast helps visual place recognition, spatial contrast improves socioeconomic prediction, and self-contrast helps safety perception, establishing a practical urban-vision benchmark. The results, complemented by attention and frequency-domain analyses, demonstrate the value of carefully aligning contrastive objectives with downstream goals and provide a foundation for applying street-view data to urban science.
Abstract
Street view imagery is extensively utilized in representation learning for urban visual environments, supporting various sustainable development tasks such as environmental perception and socio-economic assessment. However, it is challenging for existing image representations to specifically encode the dynamic urban environment (such as pedestrians, vehicles, and vegetation), the built environment (including buildings, roads, and urban infrastructure), and the environmental ambiance (such as the cultural and socioeconomic atmosphere) depicted in street view imagery to address downstream tasks related to the city. In this work, we propose an innovative self-supervised learning framework that leverages temporal and spatial attributes of street view imagery to learn image representations of the dynamic urban environment for diverse downstream tasks. By employing street view images captured at the same location over time and spatially nearby views at the same time, we construct contrastive learning tasks designed to learn the temporal-invariant characteristics of the built environment and the spatial-invariant neighborhood ambiance. Our approach significantly outperforms traditional supervised and unsupervised methods in tasks such as visual place recognition, socioeconomic estimation, and human-environment perception. Moreover, we demonstrate the varying behaviors of image representations learned through different contrastive learning objectives across various downstream tasks. This study systematically discusses representation learning strategies for urban studies based on street view images, providing a benchmark that enhances the applicability of visual data in urban science. The code is available at https://github.com/yonglleee/UrbanSTCL.
