Less is More: Multimodal Region Representation via Pairwise Inter-view Learning
Min Namgung, Yijun Lin, JangHyeon Lee, Yao-Yi Chiang
TL;DR
CooKIE introduces a scalable information-factorization framework for multimodal region representation learning, addressing the loss of modality-specific information in traditional inter-view contrastive learning. By combining intra-view pretraining with pairwise inter-view learning, it captures both unique and pairwise shared information across multiple geospatial modalities without explicit high-order dependency modeling. Empirical results on NYC and Delhi show state-of-the-art performance across population density, crime rate, greenness, and land use tasks, with substantial parameter and FLOP reductions compared to GFactorCL. The work demonstrates the practical benefit of pairwise inter-view learning for complex multimodal geospatial data and provides an efficient path to incorporating more modalities in RRL.
Abstract
With the increasing availability of geospatial datasets, researchers have explored region representation learning (RRL) to analyze complex region characteristics. Recent RRL methods use contrastive learning (CL) to capture shared information between two modalities but often overlook task-relevant unique information specific to each modality. Such modality-specific details can explain region characteristics that shared information alone cannot capture. Bringing information factorization to RRL can address this by factorizing multimodal data into shared and unique information. However, existing factorization approaches focus on two modalities, whereas RRL can benefit from various geospatial data. Extending factorization beyond two modalities is non-trivial because modeling high-order relationships introduces a combinatorial number of learning objectives, increasing model complexity. We introduce Cross modal Knowledge Injected Embedding, an information factorization approach for RRL that captures both shared and unique representations. CooKIE uses a pairwise inter-view learning approach that captures high-order information without modeling high-order dependency, avoiding exhaustive combinations. We evaluate CooKIE on three regression tasks and a land use classification task in New York City and Delhi, India. Results show that CooKIE outperforms existing RRL methods and a factorized RRL model, capturing multimodal information with fewer training parameters and floating-point operations per second (FLOPs). We release the code: https://github.com/MinNamgung/CooKIE.
