Table of Contents
Fetching ...

SemaPop: Semantic-Persona Conditioned Population Synthesis

Zhenlin Qin, Yancheng Ling, Leizhen Wang, Francisco Câmara Pereira, Zhenliang Ma

TL;DR

Results demonstrate that SemaPop-GAN enables controllable and interpretable population synthesis through effective semantic-statistical information fusion, and provides a promising modular foundation for developing generative population projection systems that integrate individual-level behavioral semantics with population-level statistical constraints.

Abstract

Population synthesis is a critical component of individual-level socio-economic simulation, yet remains challenging due to the need to jointly represent statistical structure and latent behavioral semantics. Existing population synthesis approaches predominantly rely on structured attributes and statistical constraints, leaving a gap in semantic-conditioned population generation that can capture abstract behavioral patterns implicitly in survey data. This study proposes SemaPop, a semantic-statistical population synthesis model that integrates large language models (LLMs) with generative population modeling. SemaPop derives high-level persona representations from individual survey records and incorporates them as semantic conditioning signals for population generation, while marginal regularization is introduced to enforce alignment with target population marginals. In this study, the framework is instantiated using a Wasserstein GAN with gradient penalty (WGAN-GP) backbone, referred to as SemaPop-GAN. Extensive experiments demonstrate that SemaPop-GAN achieves improved generative performance, yielding closer alignment with target marginal and joint distributions while maintaining sample-level feasibility and diversity under semantic conditioning. Ablation studies further confirm the contribution of semantic persona conditioning and architectural design choices to balancing marginal consistency and structural realism. These results demonstrate that SemaPop-GAN enables controllable and interpretable population synthesis through effective semantic-statistical information fusion. SemaPop-GAN also provides a promising modular foundation for developing generative population projection systems that integrate individual-level behavioral semantics with population-level statistical constraints.

SemaPop: Semantic-Persona Conditioned Population Synthesis

TL;DR

Results demonstrate that SemaPop-GAN enables controllable and interpretable population synthesis through effective semantic-statistical information fusion, and provides a promising modular foundation for developing generative population projection systems that integrate individual-level behavioral semantics with population-level statistical constraints.

Abstract

Population synthesis is a critical component of individual-level socio-economic simulation, yet remains challenging due to the need to jointly represent statistical structure and latent behavioral semantics. Existing population synthesis approaches predominantly rely on structured attributes and statistical constraints, leaving a gap in semantic-conditioned population generation that can capture abstract behavioral patterns implicitly in survey data. This study proposes SemaPop, a semantic-statistical population synthesis model that integrates large language models (LLMs) with generative population modeling. SemaPop derives high-level persona representations from individual survey records and incorporates them as semantic conditioning signals for population generation, while marginal regularization is introduced to enforce alignment with target population marginals. In this study, the framework is instantiated using a Wasserstein GAN with gradient penalty (WGAN-GP) backbone, referred to as SemaPop-GAN. Extensive experiments demonstrate that SemaPop-GAN achieves improved generative performance, yielding closer alignment with target marginal and joint distributions while maintaining sample-level feasibility and diversity under semantic conditioning. Ablation studies further confirm the contribution of semantic persona conditioning and architectural design choices to balancing marginal consistency and structural realism. These results demonstrate that SemaPop-GAN enables controllable and interpretable population synthesis through effective semantic-statistical information fusion. SemaPop-GAN also provides a promising modular foundation for developing generative population projection systems that integrate individual-level behavioral semantics with population-level statistical constraints.
Paper Structure (48 sections, 37 equations, 12 figures, 8 tables)

This paper contains 48 sections, 37 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Overview of the proposed SemaPop framework, divided into training and generating phases. In training, population agents $x$ extracted from individual survey data are used to prompt a frozen LLM to generate persona text. The persona text is then encoded into a persona embedding and injected into the generator via a conditioning module. The generator produces $\hat{x}$ conditioned on persona semantics and is optimized against real data $x$ under a specified training objective. In generation, persona text is encoded in the same way and combined with Gaussian noise to synthesize new population agents $\hat{x}$.
  • Figure 2: Prompt template for persona generation. The template specifies structured demographic, household, and behavioral inputs, together with explicit guidance and constraints, for prompting an LLM to generate persona descriptions. Content shown in brackets corresponds to an example for a specific population agent.
  • Figure 3: Persona Embedding Module. Persona text is first tokenized and processed by the LLM to obtain hidden states from the last $L$ Transformer layers. These layer-wise representations are averaged, followed by masked mean pooling over tokens and LayerNorm, to produce a fixed-dimensional persona embedding.
  • Figure 4: Semantic conditioning via FiLM modulation. Persona embeddings are adapted into a conditioning vector and fused into the generator through feature-wise FiLM modulation at intermediate layers; the $\oplus$ symbol denotes conditional affine modulation rather than additive or residual operations.
  • Figure 5: Overview of the SemaPop-GAN model. Persona embeddings condition a WGAN-GP based generator through FiLM, while marginal regularization enforces alignment with target population statistics under both factual and counterfactual scenarios.
  • ...and 7 more figures