Table of Contents
Fetching ...

Less is More: Multimodal Region Representation via Pairwise Inter-view Learning

Min Namgung, Yijun Lin, JangHyeon Lee, Yao-Yi Chiang

TL;DR

CooKIE introduces a scalable information-factorization framework for multimodal region representation learning, addressing the loss of modality-specific information in traditional inter-view contrastive learning. By combining intra-view pretraining with pairwise inter-view learning, it captures both unique and pairwise shared information across multiple geospatial modalities without explicit high-order dependency modeling. Empirical results on NYC and Delhi show state-of-the-art performance across population density, crime rate, greenness, and land use tasks, with substantial parameter and FLOP reductions compared to GFactorCL. The work demonstrates the practical benefit of pairwise inter-view learning for complex multimodal geospatial data and provides an efficient path to incorporating more modalities in RRL.

Abstract

With the increasing availability of geospatial datasets, researchers have explored region representation learning (RRL) to analyze complex region characteristics. Recent RRL methods use contrastive learning (CL) to capture shared information between two modalities but often overlook task-relevant unique information specific to each modality. Such modality-specific details can explain region characteristics that shared information alone cannot capture. Bringing information factorization to RRL can address this by factorizing multimodal data into shared and unique information. However, existing factorization approaches focus on two modalities, whereas RRL can benefit from various geospatial data. Extending factorization beyond two modalities is non-trivial because modeling high-order relationships introduces a combinatorial number of learning objectives, increasing model complexity. We introduce Cross modal Knowledge Injected Embedding, an information factorization approach for RRL that captures both shared and unique representations. CooKIE uses a pairwise inter-view learning approach that captures high-order information without modeling high-order dependency, avoiding exhaustive combinations. We evaluate CooKIE on three regression tasks and a land use classification task in New York City and Delhi, India. Results show that CooKIE outperforms existing RRL methods and a factorized RRL model, capturing multimodal information with fewer training parameters and floating-point operations per second (FLOPs). We release the code: https://github.com/MinNamgung/CooKIE.

Less is More: Multimodal Region Representation via Pairwise Inter-view Learning

TL;DR

CooKIE introduces a scalable information-factorization framework for multimodal region representation learning, addressing the loss of modality-specific information in traditional inter-view contrastive learning. By combining intra-view pretraining with pairwise inter-view learning, it captures both unique and pairwise shared information across multiple geospatial modalities without explicit high-order dependency modeling. Empirical results on NYC and Delhi show state-of-the-art performance across population density, crime rate, greenness, and land use tasks, with substantial parameter and FLOP reductions compared to GFactorCL. The work demonstrates the practical benefit of pairwise inter-view learning for complex multimodal geospatial data and provides an efficient path to incorporating more modalities in RRL.

Abstract

With the increasing availability of geospatial datasets, researchers have explored region representation learning (RRL) to analyze complex region characteristics. Recent RRL methods use contrastive learning (CL) to capture shared information between two modalities but often overlook task-relevant unique information specific to each modality. Such modality-specific details can explain region characteristics that shared information alone cannot capture. Bringing information factorization to RRL can address this by factorizing multimodal data into shared and unique information. However, existing factorization approaches focus on two modalities, whereas RRL can benefit from various geospatial data. Extending factorization beyond two modalities is non-trivial because modeling high-order relationships introduces a combinatorial number of learning objectives, increasing model complexity. We introduce Cross modal Knowledge Injected Embedding, an information factorization approach for RRL that captures both shared and unique representations. CooKIE uses a pairwise inter-view learning approach that captures high-order information without modeling high-order dependency, avoiding exhaustive combinations. We evaluate CooKIE on three regression tasks and a land use classification task in New York City and Delhi, India. Results show that CooKIE outperforms existing RRL methods and a factorized RRL model, capturing multimodal information with fewer training parameters and floating-point operations per second (FLOPs). We release the code: https://github.com/MinNamgung/CooKIE.

Paper Structure

This paper contains 27 sections, 14 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison of shared information captured during inter-view learning between CooKIEand the direct extension of FactorCL - GFactorCL - for three modalities. CooKIE captures three shared information ($S_{z_{pa}}, S_{z_{ps}}, S_{z_{as}}$) without considering conditional mutual information (CMI). GFactorCL captures three conditional pairwise information ($S_{z_{pa|s}}, S_{z_{ps|a}}, S_{z_{as|p}}$) and one high-order information ($S_{z_{pas}}$). Notably, the combined pairwise embeddings in CooKIE are larger than those in GFactorCL, as each shared information term captures high-order dependencies. Finally, the machine learning (ML)-based predictor utilizes all shared embeddings while handling multiple representations in predicting socioeconomic indicators.
  • Figure 2: CooKIE consists of two goals: (a) intra-view learning, which captures each modality's representation by comparing other regions, and (b) pairwise inter-view learning, which extracts task-relevant unique and shared information across modalities. refers to an augmentation and represents pretraining each modality's backbone.
  • Figure 3: Error distribution in greenness score in NYC and Delhi between Urban2Vec and CooKIE ($\mathcal{S}+\mathcal{P}+\mathcal{A}$). The lighter the color (close to white), the smaller the error.
  • Figure 4: Effect of Modalities. We report $R^2$ for population density, greenness score, and crime rate, and $L1$ on land use classification to show the effect of modalities.