Table of Contents
Fetching ...

Online Data Collection for Efficient Semiparametric Inference

Shantanu Gupta, Zachary C. Lipton, David Childers

TL;DR

This work presents two online data collection policies, Explore-then-Commit and Explore-then-Greedy, that use the parameter estimates at a given time to optimally allocate the remaining budget in the future steps, and proves that both policies achieve zero regret relative to an oracle policy.

Abstract

While many works have studied statistical data fusion, they typically assume that the various datasets are given in advance. However, in practice, estimation requires difficult data collection decisions like determining the available data sources, their costs, and how many samples to collect from each source. Moreover, this process is often sequential because the data collected at a given time can improve collection decisions in the future. In our setup, given access to multiple data sources and budget constraints, the agent must sequentially decide which data source to query to efficiently estimate a target parameter. We formalize this task using Online Moment Selection, a semiparametric framework that applies to any parameter identified by a set of moment conditions. Interestingly, the optimal budget allocation depends on the (unknown) true parameters. We present two online data collection policies, Explore-then-Commit and Explore-then-Greedy, that use the parameter estimates at a given time to optimally allocate the remaining budget in the future steps. We prove that both policies achieve zero regret (assessed by asymptotic MSE) relative to an oracle policy. We empirically validate our methods on both synthetic and real-world causal effect estimation tasks, demonstrating that the online data collection policies outperform their fixed counterparts.

Online Data Collection for Efficient Semiparametric Inference

TL;DR

This work presents two online data collection policies, Explore-then-Commit and Explore-then-Greedy, that use the parameter estimates at a given time to optimally allocate the remaining budget in the future steps, and proves that both policies achieve zero regret relative to an oracle policy.

Abstract

While many works have studied statistical data fusion, they typically assume that the various datasets are given in advance. However, in practice, estimation requires difficult data collection decisions like determining the available data sources, their costs, and how many samples to collect from each source. Moreover, this process is often sequential because the data collected at a given time can improve collection decisions in the future. In our setup, given access to multiple data sources and budget constraints, the agent must sequentially decide which data source to query to efficiently estimate a target parameter. We formalize this task using Online Moment Selection, a semiparametric framework that applies to any parameter identified by a set of moment conditions. Interestingly, the optimal budget allocation depends on the (unknown) true parameters. We present two online data collection policies, Explore-then-Commit and Explore-then-Greedy, that use the parameter estimates at a given time to optimally allocate the remaining budget in the future steps. We prove that both policies achieve zero regret (assessed by asymptotic MSE) relative to an oracle policy. We empirically validate our methods on both synthetic and real-world causal effect estimation tasks, demonstrating that the online data collection policies outperform their fixed counterparts.

Paper Structure

This paper contains 54 sections, 34 theorems, 140 equations, 7 figures, 1 algorithm.

Key Result

Proposition 1

Suppose that (i) Assumption assump:standard-gmm holds; (ii) $\forall i \in [M], \psi^{(i)}$ satisfies Property property:ulln; (iii) $\forall (i, j) \in [M]^2, \psi^{(i)} \psi^{(j)}$ satisfies Property property:ulln. Then $\widehat{\beta}_T \overset{p}{\to} \beta^*$.

Figures (7)

  • Figure 1: Examples of causal models---with treatment $X$ and outcome $Y$---where the ATE can be identified by different data sources returning different subsets of variables.
  • Figure 2: Algorithms for OMS-ETC and OMS-ETG.
  • Figure 3: Results for the two-sample IV LATE estimation task (Example \ref{['example:two-sample-late']}) for a nonlinear causal model where MLPs are used for nuisance estimation (error bars denote $95\%$ CIs). In this case, significant bias is incurred due to nuisance estimation even at large horizons. The online data collection policies outperform the fixed policy in terms of regret and coverage.
  • Figure 4: Results on two real-world causal effect estimation tasks (error bars denote $95\%$ CIs). In both cases, we observe that the online data collection policies outperform the fixed policy (as the budget increases).
  • Figure 5: The $\epsilon$-greedy data collection policy.
  • ...and 2 more figures

Theorems & Definitions (74)

  • Example 1: Two-sample IV
  • Example 2: Two-sample LATE
  • Example 3
  • Proposition 1: Consistency
  • Definition 1: Selection simplex
  • Proposition 2: Asymptotic normality
  • Proposition 3: Asymptotic inference
  • Remark 1
  • Definition 2: Asymptotic regret
  • Proposition 4
  • ...and 64 more