Table of Contents
Fetching ...

Efficient Statistics With Unknown Truncation, Polynomial Time Algorithms, Beyond Gaussians

Jane H. Lee, Anay Mehrotra, Manolis Zampetakis

TL;DR

This work provides an algorithm with polynomial sample and time complexity that works for a set of exponential families (that contains multivariate Gaussians) when the unknown survival set is a halfspace or an axis-aligned rectangle.

Abstract

We study the estimation of distributional parameters when samples are shown only if they fall in some unknown set $S \subseteq \mathbb{R}^d$. Kontonis, Tzamos, and Zampetakis (FOCS'19) gave a $d^{\mathrm{poly}(1/\varepsilon)}$ time algorithm for finding $\varepsilon$-accurate parameters for the special case of Gaussian distributions with diagonal covariance matrix. Recently, Diakonikolas, Kane, Pittas, and Zarifis (COLT'24) showed that this exponential dependence on $1/\varepsilon$ is necessary even when $S$ belongs to some well-behaved classes. These works leave the following open problems which we address in this work: Can we estimate the parameters of any Gaussian or even extend beyond Gaussians? Can we design $\mathrm{poly}(d/\varepsilon)$ time algorithms when $S$ is a simple set such as a halfspace? We make progress on both of these questions by providing the following results: 1. Toward the first question, we give a $d^{\mathrm{poly}(\ell/\varepsilon)}$ time algorithm for any exponential family that satisfies some structural assumptions and any unknown set $S$ that is $\varepsilon$-approximable by degree-$\ell$ polynomials. This result has two important applications: 1a) The first algorithm for estimating arbitrary Gaussian distributions from samples truncated to an unknown $S$; and 1b) The first algorithm for linear regression with unknown truncation and Gaussian features. 2. To address the second question, we provide an algorithm with runtime $\mathrm{poly}(d/\varepsilon)$ that works for a set of exponential families (containing all Gaussians) when $S$ is a halfspace or an axis-aligned rectangle. Along the way, we develop tools that may be of independent interest, including, a reduction from PAC learning with positive and unlabeled samples to PAC learning with positive and negative samples that is robust to certain covariate shifts.

Efficient Statistics With Unknown Truncation, Polynomial Time Algorithms, Beyond Gaussians

TL;DR

This work provides an algorithm with polynomial sample and time complexity that works for a set of exponential families (that contains multivariate Gaussians) when the unknown survival set is a halfspace or an axis-aligned rectangle.

Abstract

We study the estimation of distributional parameters when samples are shown only if they fall in some unknown set . Kontonis, Tzamos, and Zampetakis (FOCS'19) gave a time algorithm for finding -accurate parameters for the special case of Gaussian distributions with diagonal covariance matrix. Recently, Diakonikolas, Kane, Pittas, and Zarifis (COLT'24) showed that this exponential dependence on is necessary even when belongs to some well-behaved classes. These works leave the following open problems which we address in this work: Can we estimate the parameters of any Gaussian or even extend beyond Gaussians? Can we design time algorithms when is a simple set such as a halfspace? We make progress on both of these questions by providing the following results: 1. Toward the first question, we give a time algorithm for any exponential family that satisfies some structural assumptions and any unknown set that is -approximable by degree- polynomials. This result has two important applications: 1a) The first algorithm for estimating arbitrary Gaussian distributions from samples truncated to an unknown ; and 1b) The first algorithm for linear regression with unknown truncation and Gaussian features. 2. To address the second question, we provide an algorithm with runtime that works for a set of exponential families (containing all Gaussians) when is a halfspace or an axis-aligned rectangle. Along the way, we develop tools that may be of independent interest, including, a reduction from PAC learning with positive and unlabeled samples to PAC learning with positive and negative samples that is robust to certain covariate shifts.
Paper Structure (79 sections, 72 theorems, 289 equations, 5 figures, 2 tables, 8 algorithms)

This paper contains 79 sections, 72 theorems, 289 equations, 5 figures, 2 tables, 8 algorithms.

Key Result

Theorem 3.4

Suppose asmp:1:sufficientMassasmp:1:polynomialStatisticsinfasmp:covinfasmp:intinfasmp:startinfasmp:proj hold. Fix any $\varepsilon,\delta\in (0,1/2)$. Fix a set $S$ satisfying $\euscr{E}(S\triangle S^\star; \theta^\star)\leq \alpha \varepsilon$. There is an algorithm that, given membership access to

Figures (5)

  • Figure 1: Illustration that $\theta^\star$ can be far from $\theta_{\rm PMLE}$ even though the gradient at $\theta^\star$ is small (\ref{['prop:intro:cor10cor11']}). To ensure $\theta^\star$ is close to $\theta_{\rm PMLE}$, we carefully select the domain $\Omega.$ The blue line is the objective of Perturbed MLE $\mathscr{L}_S(\cdot).$
  • Figure 2: Illustration of the properties of $\theta_0$ (see \ref{['lem:findingTheta0']}): $\theta_0$ is at most $\mathrm{poly}(1/\alpha)$ far from $\theta^\star$ (\ref{['fig:thetaZeroProperties:parameterDistance']}), $\euscr{E}(\theta_0)$ can have a constant TV distance with $\euscr{E}(\theta^\star)$ (\ref{['fig:thetaZeroProperties:distribution']}), and, yet, for any set $T$ (e.g., $T=[-1,0]$), $\euscr{E}(T;\theta_0)$ is lower bounded by $\euscr{E}(T;\theta^\star)^{\mathrm{poly}({1/\alpha})}$ (\ref{['fig:thetaZeroProperties:mass']}).
  • Figure 3: Illustration of the marginal distributions constructed in the reduction in \ref{['thm:efficientLearning:denisReduction']} with $\euscr{Q}=\euscr{E}(\theta^\star)$, $P=S^\star=[0,1]$, $\euscr{P}=\euscr{E}(\theta^\star, S^\star)$, and $\euscr{U}=\euscr{E}(\theta_0)$. Here, $\euscr{E}(\cdot)$ is the family of normal distributions and $\theta_0$ is the parameter promised in \ref{['lem:findingTheta0']}. Note that $\left(\euscr{D}_{\rho,\euscr{P},\euscr{U}}\right)$ places mass outside of $S^\star$ unlike $\euscr{E}(\theta^\star, S^\star)$.
  • Figure 4: An illustration of the $\chi^2$-Bridge promised to exist in \ref{['infasmp:bridge']} (or its formal version \ref{['asmp:bridge']}). This bridge always exists for Gaussians and product exponential distributions after pre-processing described in \ref{['sec:preprocess']}. In this example, the two distributions $\euscr{E}(\theta_1)$ and $\euscr{E}(\theta_2)$ are far from each other in $\chi^2$-divergence ($\chi^2\left(\euscr{E}(\theta_1)\|\euscr{E}(\theta_2)\right), \chi^2\left(\euscr{E}(\theta_2)\|\euscr{E}(\theta_1)\right)\geq 10^{86}$) and the bridge distribution $\euscr{E}(\theta)$ (denoted by the black-line) is close to both distributions in $\chi^2$-divergence ($\chi^2\left(\euscr{E}(\theta_1)\|\euscr{E}(\theta)\right), \chi^2\left(\euscr{E}(\theta_2)\|\euscr{E}(\theta)\right) \leq 10^2$).
  • Figure 5: A construction appearing in the proof of \ref{['lem:halfspaceLearner:symDiff']}. This is used to bound $\left\lVert v \right\rVert_2$, where $v$ is the intersection of $S$ and $S^\star_{t}~$ in ${\rm span}(w,w^\star)$, in terms of $\left\lVert x_i-x_j \right\rVert_2$, $\left\lVert x_i \right\rVert_2$, and $\left\lVert x_j \right\rVert_2$.

Theorems & Definitions (121)

  • Definition 1: Canonical Exponential Family
  • Definition 2: Polynomial Approximability
  • Remark 3.1
  • Remark 3.2: Boosting Success Probability
  • Remark 3.3: Satisfying \ref{['infasmp:2']}
  • Definition 3: Perturbed MLE
  • Theorem 3.4: Learning $\theta^\star$ Given $S\approx S^\star$; see \ref{['thm:module:reduction']}
  • Lemma 3.5: Norm of Gradient at $\theta^\star$; see \ref{['prop:intro:cor10cor11']}
  • Lemma 3.6: Ensuring Strong Convexity; see \ref{['lem:findingTheta0Formal', 'lem:strongConvexityOnOmega']}
  • Lemma 3.7: See \ref{['thm:module:unlabeledSamples:measureGuarantees']}
  • ...and 111 more