Table of Contents
Fetching ...

Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

Satiyabooshan Murugaboopathy, Connor T. Jerzak, Adel Daoud

TL;DR

This work investigates how socio-economic indicators like the International Wealth Index can be recovered from both satellite imagery and text using a five-pipeline multimodal framework. It jointly analyzes vision, LLM-generated narratives, and AI-search agent text, testing the Platonic Representation Hypothesis of cross-modal convergence and the Agent-Induced Novelty Hypothesis in Africa using a DHS-derived, large-scale multimodal dataset. The results show that fused vision-language representations outperform single modalities, with moderate cross-modal convergence (median cosine similarity ~0.60) and limited evidence for agent-driven novelty, while LLM internal knowledge often rivals or surpasses agent-sourced information. The study provides a scalable, open dataset and insights into how multi-source signals can support policy-relevant poverty monitoring, while highlighting remaining limitations such as geographic sampling biases and computational costs.

Abstract

We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.

Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

TL;DR

This work investigates how socio-economic indicators like the International Wealth Index can be recovered from both satellite imagery and text using a five-pipeline multimodal framework. It jointly analyzes vision, LLM-generated narratives, and AI-search agent text, testing the Platonic Representation Hypothesis of cross-modal convergence and the Agent-Induced Novelty Hypothesis in Africa using a DHS-derived, large-scale multimodal dataset. The results show that fused vision-language representations outperform single modalities, with moderate cross-modal convergence (median cosine similarity ~0.60) and limited evidence for agent-driven novelty, while LLM internal knowledge often rivals or surpasses agent-sourced information. The study provides a scalable, open dataset and insights into how multi-source signals can support policy-relevant poverty monitoring, while highlighting remaining limitations such as geographic sampling biases and computational costs.

Abstract

We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.

Paper Structure

This paper contains 22 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: DAG representing the imprint of the poverty/wealth index $Y_i$ on satellite imagery $\mathbf{M}_i$ and textual data $\mathbf{W}_i$, processed by LLM's neural weights and queried by AI Search Agent. $\mathbf{W}_i$ is challenging to observe directly, hence colored in red. Instead, the goal is to reconstruct a faithful representation via the LLM's neural memory or AI agent search capabilities. These texts are denoted $\mathbf{W}_i^{\text{Desc}}$ for the LLM, and $\mathbf{W}_i^{\text{Trace}}$ (raw search result) and $\mathbf{W}_i^{\text{Summary}}$ (summary of search result) for the search agent. $\mathbf{B}_i$ represents background factors.
  • Figure 2: Overview of the quintuple prediction framework for estimating household wealth (IWI) from DHS clusters. The Agent-Induced Novelty Hypothesis questions whether the red arrow is actually present to the same degree as others.
  • Figure 3: Test performance ($R^2$) across evaluation splits (Random, OOC, OOT), grouped by model procedure and data source. Each dot represents the $R^2$ for a model under a specific split strategy. The dot shape encodes the split type, and the dot color encodes the embedding model. Rows are ordered by highest $R^2$ for readability. This visualization highlights the performance of various model+embedding combinations across generalization scenarios.
  • Figure 4: Spatial distribution of the difference in prediction residuals between CV-only and NMR+CV. Blue indicates regions where the baseline CV model outperforms NMR+CV. Red indicates regions where NMR+CV outperforms CV. Hexagons summarize values across locations.
  • Figure 5: Absolute mean residual difference across years for the best system: NMR+ASA+CV vs. CV-only baseline.
  • ...and 5 more figures