Synthpop++: A Hybrid Framework for Generating A Country-scale Synthetic Population
Bhavesh Neekhra, Kshitij Kapoor, Debayan Gupta
TL;DR
SynthPop++ tackles the challenge of privacy-preserving, real-scale population data for agent-based modelling by fusing multiple surveys with partially overlapping attributes. It deploys a hybrid pipeline that combines marginal distribution matching via Iterative Proportional Updating (IPU) with joint-attribute modeling via CTGAN, augmented by density-aware geolocation sampling and distance-based external-location assignment to create realistic family structures and networks. The approach is validated on Indian data and demonstrated in BharatSim, showing fidelity in marginal and joint distributions and applicability to district-, state-, and country-scale simulations. This work advances policy analysis and disease modelling under privacy constraints by enabling transparent, reproducible synthetic population generation with rich geographical and social structure.
Abstract
Population censuses are vital to public policy decision-making. They provide insight into human resources, demography, culture, and economic structure at local, regional, and national levels. However, such surveys are very expensive (especially for low and middle-income countries with high populations, such as India), time-consuming, and may also raise privacy concerns, depending upon the kinds of data collected. In light of these issues, we introduce SynthPop++, a novel hybrid framework, which can combine data from multiple real-world surveys (with different, partially overlapping sets of attributes) to produce a real-scale synthetic population of humans. Critically, our population maintains family structures comprising individuals with demographic, socioeconomic, health, and geolocation attributes: this means that our ``fake'' people live in realistic locations, have realistic families, etc. Such data can be used for a variety of purposes: we explore one such use case, Agent-based modelling of infectious disease in India. To gauge the quality of our synthetic population, we use both machine learning and statistical metrics. Our experimental results show that synthetic population can realistically simulate the population for various administrative units of India, producing real-scale, detailed data at the desired level of zoom -- from cities, to districts, to states, eventually combining to form a country-scale synthetic population.
