Winners with Confidence: Discrete Argmin Inference with an Application to Model Selection

Tianyu Zhang; Hao Lee; Jing Lei

Winners with Confidence: Discrete Argmin Inference with an Application to Model Selection

Tianyu Zhang, Hao Lee, Jing Lei

TL;DR

An asymptotically normal test statistic is developed, even in high-dimensional settings and with potentially many ties in the population mean vector, by integrating concepts and tools from cross-validation and differential privacy.

Abstract

We study the problem of finding the index of the minimum value of a vector from noisy observations. This problem is relevant in population/policy comparison, discrete maximum likelihood, and model selection. We develop an asymptotically normal test statistic, even in high-dimensional settings and with potentially many ties in the population mean vector, by integrating concepts and tools from cross-validation and differential privacy. The key technical ingredient is a central limit theorem for globally dependent data. We also propose practical ways to select the tuning parameter that adapts to the signal landscape. Numerical experiments and data examples demonstrate the ability of the proposed method to achieve a favorable bias-variance trade-off in practical scenarios.

Winners with Confidence: Discrete Argmin Inference with an Application to Model Selection

TL;DR

Abstract

Paper Structure (53 sections, 29 theorems, 337 equations, 14 figures, 1 algorithm)

This paper contains 53 sections, 29 theorems, 337 equations, 14 figures, 1 algorithm.

Introduction
Related work.
Notation
Methods
Reduction to a Selective Mean-testing Problem
Initial Fix: Removing Dependence by Cross-validation
Final Fix: Cross-validated Exponential Mechanism
Asymptotic Normality and Coverage
Variance Estimation
Key Ingredients to Theorem \ref{['th: random center']}
Bias Analysis and Power Guarantees
Bias Analysis
Power Guarantees
Data-driven Selection of the Weighting Parameter
Iterative Data-driven Selection
...and 38 more sections

Key Result

Theorem 3.1

Let $X_i \in \mathbb{R}^p, i \in [n]$, be IID samples with uniformly bounded entries: $\sup_{s\in [p]} \left | X_{i,s} \right | \leq M$ almost surely for a constant $M$. The dimension $p$ can depend on $n$ so long as the assumptions below are satisfied. We further assume Define the centered version of $T_r$: where $\sigma_r^2 = \operatorname{Var}\left[X_{1, r}-Q_{1, r} \right]$, and Then for an

Figures (14)

Figure 1: Sample splitting and exponential weighting are both crucial for normality. Smoothed histograms of the normalized $T_r$ in \ref{['algorithm: exp weighting']} and its related variants. We take $r = 1$. weighted+split is the normalized $T_r$ presented in \ref{['algorithm: exp weighting']}; split is described in Section \ref{['section: simple split']}, and weighted is the non-split version of weighted+split, discussed in Remark \ref{['remark: softmin nosplit']}. The solid black line is the density curve of the standard normal. A LOO ($V = n$) sample-splitting scheme is employed in split and weighted+split.
Figure 2: The second-order stability term vanishes at the rate predicted by our theoretical analysis. The violin plots illustrate the distribution of $\log_{10} \left(\nabla_l \nabla_j K_i\right)^2$ stratified by sample size. The points are the estimated $\log_{10}\mathbb{E}[\left(\nabla_l \nabla_j K_i\right)^2]$ over $10^3$ simulations repeats.
Figure 3: Method comparison, "increasing" (top) and "3-tier" (bottom) landscapes. Comparison between the proposed LOO method and three other methods. Each cell in the heatmaps corresponds to a different simulation setting. The x-axis corresponds to different dependency strength $\varrho$, and in the y-axis, signal strength $f$ is varied. The color (and shape size) in each cell illustrates the difference in the average number of false negatives between the proposed LOO method and one literature method. A more negative value indicates a greater advantage of the proposed method over its competitor in rejecting sub-optimal dimensions.
Figure 4: Sensitivity analysis of the weighting parameter $\lambda$ in terms of average coverage $\overline{\nu}$ and average power $\overline{\kappa}$. Here $q$ is the distortion exponent in $\lambda = 2^q \hat{\lambda}$. Notably, $\hat{\lambda}$ may take different values across the settings. For each configuration $(\mu_b, f, \varrho, \lambda)$, we perform $100$ simulation repetitions with a sample size of $1000$. The curves corresponding to the setting $(\text{mean factor}, \text{dependence strength}) = (0, \varrho)$, for $\varrho \in \{0, 0.4, 0.8\}$, are omitted from the top-right plot as they coincide with the flat mean cases illustrated in the top-left plot.
Figure 5: Average exclusion percentage, LASSO model selection. The numerical experiments are conducted over two different test sample sizes, $n = 40$ (left) and $n = 160$ (right). The gray dotted curve represents the true population risks of the $\beta_r$'s, with the risk values shown on the right y-axis. Comparison among the proposed LOO method, Bonferroni correction (BC) and rank inference approach by mogstad2024inference (RI). Each solid curve documents the proportion of the $100$ models---each corresponds to a $\eta_r$ parameter---being excluded from the confidence sets. The exclusion percentage is calculated over $10^3$ repeats.
...and 9 more figures

Theorems & Definitions (71)

Remark 2.1
Remark 2.2
Theorem 3.1
Remark 3.2
Remark 3.3
Remark 3.4
Corollary 3.5
Remark 3.6
Theorem 3.7
Definition 3.8
...and 61 more

Winners with Confidence: Discrete Argmin Inference with an Application to Model Selection

TL;DR

Abstract

Winners with Confidence: Discrete Argmin Inference with an Application to Model Selection

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (71)