Table of Contents
Fetching ...

Learning Street View Representations with Spatiotemporal Contrast

Yong Li, Yingjing Huang, Gengchen Mai, Fan Zhang

TL;DR

This work tackles how to learn street-view representations that capture both dynamic urban elements and ambient cues by introducing a self-supervised spatiotemporal contrastive framework. It defines temporal-invariance at a fixed location, spatial-invariance across nearby areas, and a global information representation to tailor features for downstream urban tasks, trained with an InfoNCE-based objective. Pretraining on a massive, geographically diverse street-view corpus yields task-specific benefits: temporal contrast helps visual place recognition, spatial contrast improves socioeconomic prediction, and self-contrast helps safety perception, establishing a practical urban-vision benchmark. The results, complemented by attention and frequency-domain analyses, demonstrate the value of carefully aligning contrastive objectives with downstream goals and provide a foundation for applying street-view data to urban science.

Abstract

Street view imagery is extensively utilized in representation learning for urban visual environments, supporting various sustainable development tasks such as environmental perception and socio-economic assessment. However, it is challenging for existing image representations to specifically encode the dynamic urban environment (such as pedestrians, vehicles, and vegetation), the built environment (including buildings, roads, and urban infrastructure), and the environmental ambiance (such as the cultural and socioeconomic atmosphere) depicted in street view imagery to address downstream tasks related to the city. In this work, we propose an innovative self-supervised learning framework that leverages temporal and spatial attributes of street view imagery to learn image representations of the dynamic urban environment for diverse downstream tasks. By employing street view images captured at the same location over time and spatially nearby views at the same time, we construct contrastive learning tasks designed to learn the temporal-invariant characteristics of the built environment and the spatial-invariant neighborhood ambiance. Our approach significantly outperforms traditional supervised and unsupervised methods in tasks such as visual place recognition, socioeconomic estimation, and human-environment perception. Moreover, we demonstrate the varying behaviors of image representations learned through different contrastive learning objectives across various downstream tasks. This study systematically discusses representation learning strategies for urban studies based on street view images, providing a benchmark that enhances the applicability of visual data in urban science. The code is available at https://github.com/yonglleee/UrbanSTCL.

Learning Street View Representations with Spatiotemporal Contrast

TL;DR

This work tackles how to learn street-view representations that capture both dynamic urban elements and ambient cues by introducing a self-supervised spatiotemporal contrastive framework. It defines temporal-invariance at a fixed location, spatial-invariance across nearby areas, and a global information representation to tailor features for downstream urban tasks, trained with an InfoNCE-based objective. Pretraining on a massive, geographically diverse street-view corpus yields task-specific benefits: temporal contrast helps visual place recognition, spatial contrast improves socioeconomic prediction, and self-contrast helps safety perception, establishing a practical urban-vision benchmark. The results, complemented by attention and frequency-domain analyses, demonstrate the value of carefully aligning contrastive objectives with downstream goals and provide a foundation for applying street-view data to urban science.

Abstract

Street view imagery is extensively utilized in representation learning for urban visual environments, supporting various sustainable development tasks such as environmental perception and socio-economic assessment. However, it is challenging for existing image representations to specifically encode the dynamic urban environment (such as pedestrians, vehicles, and vegetation), the built environment (including buildings, roads, and urban infrastructure), and the environmental ambiance (such as the cultural and socioeconomic atmosphere) depicted in street view imagery to address downstream tasks related to the city. In this work, we propose an innovative self-supervised learning framework that leverages temporal and spatial attributes of street view imagery to learn image representations of the dynamic urban environment for diverse downstream tasks. By employing street view images captured at the same location over time and spatially nearby views at the same time, we construct contrastive learning tasks designed to learn the temporal-invariant characteristics of the built environment and the spatial-invariant neighborhood ambiance. Our approach significantly outperforms traditional supervised and unsupervised methods in tasks such as visual place recognition, socioeconomic estimation, and human-environment perception. Moreover, we demonstrate the varying behaviors of image representations learned through different contrastive learning objectives across various downstream tasks. This study systematically discusses representation learning strategies for urban studies based on street view images, providing a benchmark that enhances the applicability of visual data in urban science. The code is available at https://github.com/yonglleee/UrbanSTCL.

Paper Structure

This paper contains 24 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Spatial and temporal contrastive learning with street view images. Using street view images captured at the same location over time, contrastive learning tasks are designed to learn the temporal-invariant characteristics of the built environment; Using spatially proximate street view images from the same period, learning tasks are crafted to learn the spatial-invariant neighborhood ambiance, such as socioeconomic atmosphere.
  • Figure 2: Performance comparison on different visual place recognition datasets (Recall@K in %).
  • Figure 3: Comparison of retrieval results using GSV-Self, GSV-Spatial, and GSV-Temporal methods for a given query image (Year: 2018, Heading: $90^\circ$, Location: Chicago). Each row corresponds to the top-5 retrieved street view images based on different self-supervised pertained models, ranked by image feature similarity to the query image. The GSV-Temporal results are all within a 10-meter radius and have identical heading angles, but correspond to different time periods, demonstrating temporal invariance of the learned image representations. The GSV-Spatial results cover a larger geographic area with nearby timeframes, maintaining a consistent overal ambiance.
  • Figure 4: Attention maps for two queries visualized across models and depths. Red boxes indicate regions of focus. GSV-Self (a, d) emphasizes objects like cars. GSV-Temporal (b, e) filters out dynamic objects, highlighting static elements. GSV-Spatial (c, f) shows consistent focus across queries, capturing overall spatial structures.
  • Figure 5: Visualization of attention distance and $\Delta$ Log Amplitude across depths for ImageNet and GSV models. Depth refers to the network layers in the ViT model, from shallow (Depth 1) to deep layers (Depth 12). (a) and (b) display the attention distance, which represents the average spatial range of the attention mechanism in each layer—a larger value indicates that the model attends to more globally distributed features, while smaller values suggest a focus on local patterns. (c) and (d) present the $\Delta$ Log Amplitude, where higher values (closer to 0) reflect stronger retention of high-frequency information (e.g., edges, textures), and lower values (more negative) indicate a focus on low-frequency components, representing global structures or smooth transitions.
  • ...and 1 more figures