Table of Contents
Fetching ...

GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM

Kyeongjin Ahn, Sungwon Han, Seungeon Lee, Donghyun Ahn, Hyoshin Kim, Jungwon Kim, Jihee Kim, Sangyoon Park, Meeyoung Cha

Abstract

Socio-economic indicators like regional GDP, population, and education levels, are crucial to shaping policy decisions and fostering sustainable development. This research introduces GeoReg a regression model that integrates diverse data sources, including satellite imagery and web-based geospatial information, to estimate these indicators even for data-scarce regions such as developing countries. Our approach leverages the prior knowledge of large language model to address the scarcity of labeled data, with the language model functioning as a data engineer by extracting informative features to enable effective estimation in few-shot settings. Specifically, our model obtains contextual relationships between data features and the target indicator, categorizing their correlations as positive, negative, mixed, or irrelevant. These features are then fed into the linear estimator with tailored weight constraints for each category. To capture nonlinear patterns, the model also identifies meaningful feature interactions and integrates them, along with nonlinear transformations. Experiments across three countries at different stages of development demonstrate that our model outperforms baselines in estimating socio-economic indicators, even for low-income countries with limited data availability.

GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM

Abstract

Socio-economic indicators like regional GDP, population, and education levels, are crucial to shaping policy decisions and fostering sustainable development. This research introduces GeoReg a regression model that integrates diverse data sources, including satellite imagery and web-based geospatial information, to estimate these indicators even for data-scarce regions such as developing countries. Our approach leverages the prior knowledge of large language model to address the scarcity of labeled data, with the language model functioning as a data engineer by extracting informative features to enable effective estimation in few-shot settings. Specifically, our model obtains contextual relationships between data features and the target indicator, categorizing their correlations as positive, negative, mixed, or irrelevant. These features are then fed into the linear estimator with tailored weight constraints for each category. To capture nonlinear patterns, the model also identifies meaningful feature interactions and integrates them, along with nonlinear transformations. Experiments across three countries at different stages of development demonstrate that our model outperforms baselines in estimating socio-economic indicators, even for low-income countries with limited data availability.

Paper Structure

This paper contains 18 sections, 3 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Challenges of estimating socio-economic indicators. In few-shot settings, limited samples disrupt finding correct patterns in data. Few-shot samples in Feature-A align with its distribution, while those in Feature-B and Feature-C do not.
  • Figure 2: Module design to extract region-aware and neighbor-aware features from heterogeneous data sources for socio-economic indicator estimation.
  • Figure 3: Overview of GeoReg. In Stage 1, underlying relationships between modules and the target indicator are extracted via LLM by categorizing the module set $\mathcal{X}$ based on relevant meta-information into four groups — Positive ($\mathcal{P}$), Negative ($\mathcal{N}$), Mixed ($\mathcal{M}$), or Irrelevant ($\mathcal{IR}$) — and discovering hidden interactions within the categorized subsets. Here, the newly discovered modules in each group are added to their corresponding original ones, which are denoted as $\tilde{\mathcal{P}}$, $\tilde{\mathcal{N}}$, and $\tilde{\mathcal{M}}$, respectively. In Stage 2, a linear regression model is trained to estimate the target indicator $\hat{y}$ using the outputs from Stage 1, along with additional augmented sets, including nonlinear transformations (i.e., $\mathcal{P'}$, $\mathcal{N'}$, and $\mathcal{M'}$), guided by distinct weight constraints that reflect their correlations.
  • Figure 4: Template prompt for module categorization in GeoReg. Key elements are highlighted in blue, with their corresponding meta-information in orange.
  • Figure 5: Template prompt for feature discovery in GeoReg.
  • ...and 4 more figures