Data-Driven Strategies for Detecting and Sampling Misrepresented Subgroups
G. Lancia, F. Mecatti, E. Riccomagno
TL;DR
The paper tackles underrepresentation of rare or hard-to-reach subgroups in EU-SILC data by reframing detection as an outlier problem and applying unsupervised methods. It combines an entropy-based univariate score, kernel PCA, and auto-encoders to identify misrepresented groups, with robust internal and stability validation and interpretability through variable inspection and spectral clustering. The empirical Liguria 2019 EU-SILC application demonstrates concrete misrepresentation patterns linked to citizenship, deprivation, and household structure, motivating targeted sampling. The authors then explore integrative sampling strategies, showing when stratified or multi-frame designs with appropriate estimators yield efficiency gains, thus enabling data enrichment and policy-relevant, region-specific inclusiveness. Overall, the approach provides a practical, data-driven framework for enhancing representation equity in large-scale surveys and can be generalized to similar social-research contexts.
Abstract
Economic policy research frequently examines population well-being, with a particular focus on the relationships between unequal living conditions, low educational attainment, and social exclusion. Sample surveys, such as EU-SILC, are widely used for this purpose and inform public policy; yet, their sampling designs may fail to adequately represent rare, hard-to-sample, or under-covered subgroups. This limitation can hinder socio-demographic analyses and evidence-based policy design. We propose a generalisable approach based on univariate and multivariate unsupervised learning techniques to detect outliers in survey data that may signal under-represented subgroups. Identified groups can then be characterised to inform targeted resampling strategies that improve survey inclusiveness. An empirical application using the 2019 EU-SILC data for the Italian region of Liguria shows that citizenship, material deprivation, large household size, and economic vulnerability are key indicators of under-representation.
