Data-Driven Strategies for Detecting and Sampling Misrepresented Subgroups

G. Lancia; F. Mecatti; E. Riccomagno

Data-Driven Strategies for Detecting and Sampling Misrepresented Subgroups

G. Lancia, F. Mecatti, E. Riccomagno

TL;DR

The paper tackles underrepresentation of rare or hard-to-reach subgroups in EU-SILC data by reframing detection as an outlier problem and applying unsupervised methods. It combines an entropy-based univariate score, kernel PCA, and auto-encoders to identify misrepresented groups, with robust internal and stability validation and interpretability through variable inspection and spectral clustering. The empirical Liguria 2019 EU-SILC application demonstrates concrete misrepresentation patterns linked to citizenship, deprivation, and household structure, motivating targeted sampling. The authors then explore integrative sampling strategies, showing when stratified or multi-frame designs with appropriate estimators yield efficiency gains, thus enabling data enrichment and policy-relevant, region-specific inclusiveness. Overall, the approach provides a practical, data-driven framework for enhancing representation equity in large-scale surveys and can be generalized to similar social-research contexts.

Abstract

Economic policy research frequently examines population well-being, with a particular focus on the relationships between unequal living conditions, low educational attainment, and social exclusion. Sample surveys, such as EU-SILC, are widely used for this purpose and inform public policy; yet, their sampling designs may fail to adequately represent rare, hard-to-sample, or under-covered subgroups. This limitation can hinder socio-demographic analyses and evidence-based policy design. We propose a generalisable approach based on univariate and multivariate unsupervised learning techniques to detect outliers in survey data that may signal under-represented subgroups. Identified groups can then be characterised to inform targeted resampling strategies that improve survey inclusiveness. An empirical application using the 2019 EU-SILC data for the Italian region of Liguria shows that citizenship, material deprivation, large household size, and economic vulnerability are key indicators of under-representation.

Data-Driven Strategies for Detecting and Sampling Misrepresented Subgroups

TL;DR

Abstract

Paper Structure (36 sections, 32 equations, 4 figures, 6 tables)

This paper contains 36 sections, 32 equations, 4 figures, 6 tables.

Introduction
Background, Context and Data
The Liguria Project
2019 EU-SILC: Sample Design and Data
Liguria Complete Tax Administrative Data
2019 Tax Declaration Dataset
Comparative Analysis of EU-SILC and Tax Declaration Dataset
Detecting misrepresented subgroups : the proposed methodology
Anomaly detection models
Entropy Score
Kernel Principal Component Analysis
Auto-Encoder
Outlier Detection
Stability Validation
Internal Validation
...and 21 more sections

Figures (4)

Figure 1: Comparison of household-level distributions for children aged 0-17 and women aged 15-55, in the Ligurian subsample of the 2019 EU-SILC survey and the administrative Tax Declaration Dataset for Liguria (2019). The graphs illustrate discrepancies between the two data sources and potential coverage issues in EU-SILC sample data.
Figure 2: Estimated population means with increasing numbers of Monte Carlo replications under three sampling scenarios: (a) stratified sampling design, (b) MF sampling design with the SM estimator, and (c) MF sampling design with the PML estimator. The trajectories illustrate the convergence of the estimators toward the theoretical population mean (dotted line). For each scenario, results are reported under both proportional and optimal cost allocation strategies. Error bars denote 95% confidence intervals obtained via bootstrap resampling.
Figure 3: Box plot for the RB estimation for STS and MF estimators across Monte Carlo iterations per allocation strategy
Figure 4: Scatter plot of the synthetic bivariate Gaussian population, stratified by sampling frame. Marginal distributions of each variable are shown along the axes for each stratum.

Data-Driven Strategies for Detecting and Sampling Misrepresented Subgroups

TL;DR

Abstract

Data-Driven Strategies for Detecting and Sampling Misrepresented Subgroups

Authors

TL;DR

Abstract

Table of Contents

Figures (4)