Table of Contents
Fetching ...

Beyond AlphaEarth: Toward Human-Centered Spatial Representation via POI-Guided Contrastive Learning

Junyuan Liu, Quan Qin, Guangsheng Dong, Xinglei Wang, Jiazhuang Feng, Zichao Zeng, Tao Cheng

TL;DR

This work tackles the limitation of EO-driven urban representations in capturing human-centric city dynamics. It introduces AETHER, a lightweight POI-guided multimodal alignment framework that enriches AlphaEarth embeddings with POI-derived textual semantics via a two-branch architecture and contrastive learning. The approach yields consistent improvements on land-use classification and socioeconomic distribution mapping in Greater London, demonstrating the value of coupling physical morphology with functional semantics while remaining computationally efficient. By providing a portable, modular recipe for integrating POIs with EO backbones, AETHER advances geospatial foundation models toward general-purpose, human-aware urban representations.

Abstract

General-purpose spatial representations are essential for building transferable geospatial foundation models (GFMs). Among them, the AlphaEarth Foundation (AE) represents a major step toward a global, unified representation of the Earth's surface, learning 10-meter embeddings from multi-source Earth Observation (EO) data that capture rich physical and environmental patterns across diverse landscapes. However, such EO-driven representations remain limited in capturing the functional and socioeconomic dimensions of cities, as they primarily encode physical and spectral patterns rather than human activities or spatial functions. We propose AETHER (AlphaEarth-POI Enriched Representation Learning), a lightweight framework that adapts AlphaEarth to human-centered urban analysis through multimodal alignment guided by Points of Interest (POIs). AETHER aligns AE embeddings with textual representations of POIs, enriching physically grounded EO features with semantic cues about urban functions and socioeconomic contexts. In Greater London, AETHER achieves consistent gains over the AE baseline, with a 7.2% relative improvement in land-use classification F1 and a 23.6% relative reduction in Kullback-Leibler divergence for socioeconomic mapping. Built upon pretrained AE, AETHER leverages a lightweight multimodal alignment to enrich it with human-centered semantics while remaining computationally efficient and scalable for urban applications. By coupling EO with human-centered semantics, it advances geospatial foundation models toward general-purpose urban representations that integrate both physical form and functional meaning.

Beyond AlphaEarth: Toward Human-Centered Spatial Representation via POI-Guided Contrastive Learning

TL;DR

This work tackles the limitation of EO-driven urban representations in capturing human-centric city dynamics. It introduces AETHER, a lightweight POI-guided multimodal alignment framework that enriches AlphaEarth embeddings with POI-derived textual semantics via a two-branch architecture and contrastive learning. The approach yields consistent improvements on land-use classification and socioeconomic distribution mapping in Greater London, demonstrating the value of coupling physical morphology with functional semantics while remaining computationally efficient. By providing a portable, modular recipe for integrating POIs with EO backbones, AETHER advances geospatial foundation models toward general-purpose, human-aware urban representations.

Abstract

General-purpose spatial representations are essential for building transferable geospatial foundation models (GFMs). Among them, the AlphaEarth Foundation (AE) represents a major step toward a global, unified representation of the Earth's surface, learning 10-meter embeddings from multi-source Earth Observation (EO) data that capture rich physical and environmental patterns across diverse landscapes. However, such EO-driven representations remain limited in capturing the functional and socioeconomic dimensions of cities, as they primarily encode physical and spectral patterns rather than human activities or spatial functions. We propose AETHER (AlphaEarth-POI Enriched Representation Learning), a lightweight framework that adapts AlphaEarth to human-centered urban analysis through multimodal alignment guided by Points of Interest (POIs). AETHER aligns AE embeddings with textual representations of POIs, enriching physically grounded EO features with semantic cues about urban functions and socioeconomic contexts. In Greater London, AETHER achieves consistent gains over the AE baseline, with a 7.2% relative improvement in land-use classification F1 and a 23.6% relative reduction in Kullback-Leibler divergence for socioeconomic mapping. Built upon pretrained AE, AETHER leverages a lightweight multimodal alignment to enrich it with human-centered semantics while remaining computationally efficient and scalable for urban applications. By coupling EO with human-centered semantics, it advances geospatial foundation models toward general-purpose urban representations that integrate both physical form and functional meaning.

Paper Structure

This paper contains 39 sections, 12 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the proposed AETHER framework. During multimodal pretraining (left), AE embeddings within dual spatial buffers (base 50 m, augmented 100 m) are pooled and projected, while POI text embeddings are generated via a text encoder and projector. Two contrastive losses are applied: a cross-modal AE–POI alignment loss and an intra-modal multi-scale AE consistency loss (weighted by $\lambda$). During inference (right), only the frozen AE projector is used to generate city-wide embeddings, which are aggregated to region-level features and passed into task heads for downstream applications.
  • Figure 2: Comparison of AETHER, AlphaEarth (AE), and the strongest external baseline (SOTA). (a) Land-Use Classification (LUC, F1↑). (b) Socioeconomic Distribution Mapping (SDM, KL↓). Dashed gray lines denote SOTA performance. Numbers above bars indicate relative changes compared with SOTA.
  • Figure 3: Sensitivity to the loss balance coefficient $\lambda$.
  • Figure 4: Model performance under varying training data volumes.