Table of Contents
Fetching ...

Exact Synthetic Populations for Scalable Societal and Market Modeling

Thierry Petit, Arnault Pachot

TL;DR

The paper presents a constraint programming framework to generate exact synthetic populations that match target distributions while guaranteeing individual-level coherence without microdata. It uses a batched solving approach with distribution constraints, optional diversity constraints, and support for interdependent features, enabling scalable generation for applications like virtual polling, territorial intelligence, and AI-driven content evaluation. The authors demonstrate near-exact distribution matches on real statistics, explore trade-offs between constraint density and accuracy, and highlight practical uses through a Pollitics platform that couples CP-generated agents with large language models. The work emphasizes data privacy, reproducibility, and data-validation capabilities, proposing a principled path to simulate complex populations for policy, market, and communications analytics.

Abstract

We introduce a constraint-programming framework for generating synthetic populations that reproduce target statistics with high precision while enforcing full individual consistency. Unlike data-driven approaches that infer distributions from samples, our method directly encodes aggregated statistics and structural relations, enabling exact control of demographic profiles without requiring any microdata. We validate the approach on official demographic sources and study the impact of distributional deviations on downstream analyses. This work is conducted within the Pollitics project developed by Emotia, where synthetic populations can be queried through large language models to model societal behaviors, explore market and policy scenarios, and provide reproducible decision-grade insights without personal data.

Exact Synthetic Populations for Scalable Societal and Market Modeling

TL;DR

The paper presents a constraint programming framework to generate exact synthetic populations that match target distributions while guaranteeing individual-level coherence without microdata. It uses a batched solving approach with distribution constraints, optional diversity constraints, and support for interdependent features, enabling scalable generation for applications like virtual polling, territorial intelligence, and AI-driven content evaluation. The authors demonstrate near-exact distribution matches on real statistics, explore trade-offs between constraint density and accuracy, and highlight practical uses through a Pollitics platform that couples CP-generated agents with large language models. The work emphasizes data privacy, reproducibility, and data-validation capabilities, proposing a principled path to simulate complex populations for policy, market, and communications analytics.

Abstract

We introduce a constraint-programming framework for generating synthetic populations that reproduce target statistics with high precision while enforcing full individual consistency. Unlike data-driven approaches that infer distributions from samples, our method directly encodes aggregated statistics and structural relations, enabling exact control of demographic profiles without requiring any microdata. We validate the approach on official demographic sources and study the impact of distributional deviations on downstream analyses. This work is conducted within the Pollitics project developed by Emotia, where synthetic populations can be queried through large language models to model societal behaviors, explore market and policy scenarios, and provide reproducible decision-grade insights without personal data.

Paper Structure

This paper contains 20 sections, 4 theorems, 6 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

proposition thmcounterproposition

Let $p_1,\dots,p_q \in [0,100]$ be target percentages with $\sum_{i=1}^q p_i = 100$ and let $N \in \mathbb{N}$ be the total number of individuals. Define the (real-valued) ideal allocations their integer parts $t_i^{(0)}=\lfloor f_i \rfloor$, and fractional parts $r_i = f_i - t_i^{(0)}$. Let Let $S \subseteq \{1,\dots,q\}$ be the indices of the $R$ largest $r_i$ (break ties arbitrarily), and set

Figures (5)

  • Figure 1: Absolute error between estimated and true vote proportions (A, B, DK), as a function of sample size. Each group of three bars represents the error for vote A, B, and DK in a specific scenario, compared with the true results over $N$=100,000 individuals.
  • Figure 2: Average MAPE by batch size, number of constraints and tightness.
  • Figure 3: Comparison between human IPSOS polling results and virtual polling results obtained with synthetic populations queried via LLMs (“LLM + Power”) on the question: “Should people be able to take refuge in other countries to escape war or persecution?”.
  • Figure 4: French territorial economic intelligence tool: synthetic populations of Lyon--Saint-Étienne--Roanne districts mapped to local economic indicators.
  • Figure 5: Illustrative example of AI-driven text evaluation: multi-criteria scoring and optimisation suggestions for a target audience.

Theorems & Definitions (11)

  • definition thmcounterdefinition: Abstract Constraints
  • proposition thmcounterproposition: Largest Remainder Rounding
  • proof
  • definition thmcounterdefinition: Distribution Constraint
  • definition thmcounterdefinition: Extension-Preserving Optimality
  • proposition thmcounterproposition
  • proposition thmcounterproposition
  • proof
  • definition thmcounterdefinition: Diversity Constraint
  • proposition thmcounterproposition
  • ...and 1 more