Table of Contents
Fetching ...

Distributionally balanced sampling designs

Anton Grafström, Wilmer Prentius

Abstract

We propose Distributionally Balanced Designs (DBD), a new class of probability sampling designs that target representativeness at the level of the full auxiliary distribution rather than selected moments. In disciplines such as ecology, forestry, and environmental sciences, where field data collection is expensive, maximizing the information extracted from a limited sample is critical. More precisely, DBD can be viewed as minimum discrepancy designs that minimize the expected discrepancy between the sample and population auxiliary distributions. The key idea is to construct samples whose empirical auxiliary distribution closely matches that of the population. We present a first implementation of DBD based on an optimized circular ordering of the population, combined with random selection of a contiguous block of units. The ordering is chosen to minimize the design-expected energy distance, a discrepancy measure that captures differences between distributions beyond low-order moments. This criterion promotes strong spatial spread, and yields low variance for Horvitz-Thompson estimators of totals of functions that vary smoothly with respect to auxiliaries. Simulation results show that approximate DBD achieves better distributional fit than state-of-the-art methods such as the local pivotal and local cube designs. Hence, DBD can improve the reliability of estimates from costly field data, making distributional balancing effective for constructing representative surveys in resource-constrained applications.

Distributionally balanced sampling designs

Abstract

We propose Distributionally Balanced Designs (DBD), a new class of probability sampling designs that target representativeness at the level of the full auxiliary distribution rather than selected moments. In disciplines such as ecology, forestry, and environmental sciences, where field data collection is expensive, maximizing the information extracted from a limited sample is critical. More precisely, DBD can be viewed as minimum discrepancy designs that minimize the expected discrepancy between the sample and population auxiliary distributions. The key idea is to construct samples whose empirical auxiliary distribution closely matches that of the population. We present a first implementation of DBD based on an optimized circular ordering of the population, combined with random selection of a contiguous block of units. The ordering is chosen to minimize the design-expected energy distance, a discrepancy measure that captures differences between distributions beyond low-order moments. This criterion promotes strong spatial spread, and yields low variance for Horvitz-Thompson estimators of totals of functions that vary smoothly with respect to auxiliaries. Simulation results show that approximate DBD achieves better distributional fit than state-of-the-art methods such as the local pivotal and local cube designs. Hence, DBD can improve the reliability of estimates from costly field data, making distributional balancing effective for constructing representative surveys in resource-constrained applications.
Paper Structure (9 sections, 1 theorem, 30 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 9 sections, 1 theorem, 30 equations, 3 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

Let $k$ be the reproducing kernel associated with the energy distance Sejdinovic2013 so that, for any probability measures $P,Q$ with finite first moment, where $\mu_P=E_{\bm{X}\sim P}[k(\bm{X},\cdot)]\in\mathcal{H}_k$ denotes the kernel mean embedding. Let $f\in\mathcal{H}_k$, define $y_i=f(\bm{x}_i)$, and consider the equal-probability fixed-size estimator Then for any sampling design with a r

Figures (3)

  • Figure 1: Illustration of the circular sequence design for a population of size $N=20$ and sample size $n=4$, before and after optimization. The nodes represent population units, with grayscale intensity indicating the values of their auxiliary variables. The inner circle of numbers provides the indices of the units in $U$. The samples $s_1$ and $s_{11}$ are shown as shaded sectors. Left: Initial sequence. Many samples selected as a contiguous block are unrepresentative of the population. Right: The optimized design where the sequence has been reordered. All contiguous blocks of size $n=4$ now provide a good representation of the population distribution.
  • Figure 2: Left panel: expected energy distance of a circular DBD at iterations from 50k-2000k, $\pm 2$ standard deviations. Right panel: box-plot of energy distances of 10000 samples of size $n=50$ selected by the local pivotal method with the same $p=5$ auxiliary variables.
  • Figure 3: Distributions of the different metrics under three designs with sample size $n=50$. Colors represent the designs: gray is Circular DBD ($10^7$ iterations), orange is LCube, blue is LPM. First row: energy distance. Second row: the local balance measure. Third row: spatial balance. Fourth row: balance deviation. Columns: number of auxiliary variables.

Theorems & Definitions (6)

  • Definition 1: Distributionally balanced design
  • Proposition 1: Upper bound on the mean square error via energy distance
  • Example 1: Decay of the expected energy distance and variability across optimizations
  • Example 2: Comparisons with some existing designs
  • Example 3: Evaluation using the Meuse dataset
  • proof