Table of Contents
Fetching ...

Proxy-Guided Measurement Calibration

Saketh Vishnubhatla, Shu Wan, Andre Harrison, Adrienne Raglin, Huan Liu

Abstract

Aggregate outcome variables collected through surveys and administrative records are often subject to systematic measurement error. For instance, in disaster loss databases, county-level losses reported may differ from the true damages due to variations in on-the-ground data collection capacity, reporting practices, and event characteristics. Such miscalibration complicates downstream analysis and decision-making. We study the problem of outcome miscalibration and propose a framework guided by proxy variables for estimating and correcting the systematic errors. We model the data-generating process using a causal graph that separates latent content variables driving the true outcome from the latent bias variables that induce systematic errors. The key insight is that proxy variables that depend on the true outcome but are independent of the bias mechanism provide identifying information for quantifying the bias. Leveraging this structure, we introduce a two-stage approach that utilizes variational autoencoders to disentangle content and bias latents, enabling us to estimate the effect of bias on the outcome of interest. We analyze the assumptions underlying our approach and evaluate it on synthetic data, semi-synthetic datasets derived from randomized trials, and a real-world case study of disaster loss reporting.

Proxy-Guided Measurement Calibration

Abstract

Aggregate outcome variables collected through surveys and administrative records are often subject to systematic measurement error. For instance, in disaster loss databases, county-level losses reported may differ from the true damages due to variations in on-the-ground data collection capacity, reporting practices, and event characteristics. Such miscalibration complicates downstream analysis and decision-making. We study the problem of outcome miscalibration and propose a framework guided by proxy variables for estimating and correcting the systematic errors. We model the data-generating process using a causal graph that separates latent content variables driving the true outcome from the latent bias variables that induce systematic errors. The key insight is that proxy variables that depend on the true outcome but are independent of the bias mechanism provide identifying information for quantifying the bias. Leveraging this structure, we introduce a two-stage approach that utilizes variational autoencoders to disentangle content and bias latents, enabling us to estimate the effect of bias on the outcome of interest. We analyze the assumptions underlying our approach and evaluate it on synthetic data, semi-synthetic datasets derived from randomized trials, and a real-world case study of disaster loss reporting.
Paper Structure (49 sections, 1 theorem, 19 equations, 6 figures, 9 tables)

This paper contains 49 sections, 1 theorem, 19 equations, 6 figures, 9 tables.

Key Result

proposition 1

For any $(e,z)$ in the support of $(E,Z)$,

Figures (6)

  • Figure 1: Causal structure for proxy-guided measurement calibration. (a) Full generative model: environment variables $E$ influence latent content factors $Z$ and latent reporting bias $A$. Proxy measurements $\{Y_1,\ldots,Y_m\}$ depend only on $Z$, while the observed outcome $Y_{\mathrm{obs}}$ depends on both $Z$ and $A$. (b) Error model highlighting the measurement bias mechanism: the true outcome $Y_{\mathrm{true}}$ is perturbed by an environment-dependent binary bias $A$ with additive magnitude $\alpha$.
  • Figure 2: County-level mean absolute CATE estimates $|\widehat{\tau}_i|$ for 2023 across four hazard types. White indicates counties with no event; lighter colors represent higher estimated reporting bias.
  • Figure 3: Mean absolute CATE values grouped by hazard type. Wildfire events exhibit the highest systematic reporting distortion, followed by tornadic and flooding events.
  • Figure 4: Scatter plot of the recovered content latents $\hat{Z}$ versus the true latents $Z$ for a representative fold, after aligning dimensions using the closest permutation that maximizes permuted $R^2$.
  • Figure 5: Sanity Check for the Distribution. Histograms for all key variables: $E_1$, $Z$, $A$, $Y_{\text{true}}$, $Y_{\text{obs}}$, and proxies for a given configuration. Ensures data follows the intended generative model.
  • ...and 1 more figures

Theorems & Definitions (2)

  • proposition 1: Identification
  • proof : Sketch