Optimal estimation of the null distribution in large-scale inference

Subhodh Kotekal; Chao Gao

Optimal estimation of the null distribution in large-scale inference

Subhodh Kotekal, Chao Gao

TL;DR

This work addresses the problem of estimating the null distribution parameters $(\theta,\sigma^2)$ under a Gaussian two-groups model with sparse nonnull effects, formalizing a regime where $k$ nonnulls may be a substantial fraction of $n$. By exploiting the Gaussian structure via a Fourier-based empirical characteristic function approach, the authors derive sharp minimax rates for both location and scale estimation, showing that consistent estimation of $\theta$ is possible iff $n-2k=\omega(\sqrt{n})$, while $\sigma^2$ is estimable for all $k< n/2$. They provide matching upper and lower bounds, achieve faster-than-Huber rates in several regimes, and develop adaptive procedures (Lepski-type) that attain minimax optimality without knowing the sparsity level $k$, with extensions to non-Gaussian noise and total-variation distances. The results offer a principled characterization of when robust null-distribution estimation is possible in large-scale inference and demonstrate practical, computation-friendly methods for accurate empirical null estimation in dense settings. This has direct implications for genome-wide studies and other high-throughput applications where a nontrivially large proportion of signals may be present.

Abstract

The advent of large-scale inference has spurred reexamination of conventional statistical thinking. In a Gaussian model for $n$ many $z$-scores with at most $k < \frac{n}{2}$ nonnulls, Efron suggests estimating the location and scale parameters of the null distribution. Placing no assumptions on the nonnull effects, the statistical task can be viewed as a robust estimation problem. However, the best known robust estimators fail to be consistent in the regime $k \asymp n$ which is especially relevant in large-scale inference. The failure of estimators which are minimax rate-optimal with respect to other formulations of robustness (e.g. Huber's contamination model) might suggest the impossibility of consistent estimation in this regime and, consequently, a major weakness of Efron's suggestion. A sound evaluation of Efron's model thus requires a complete understanding of consistency. We sharply characterize the regime of $k$ for which consistent estimation is possible and further establish the minimax estimation rates. It is shown consistent estimation of the location parameter is possible if and only if $\frac{n}{2} - k = ω(\sqrt{n})$, and consistent estimation of the scale parameter is possible in the entire regime $k < \frac{n}{2}$. Faster rates than those in Huber's contamination model are achievable by exploiting the Gaussian character of the data. The minimax upper bound is obtained by considering estimators based on the empirical characteristic function. The minimax lower bound involves constructing two marginal distributions whose characteristic functions match on a wide interval containing zero. The construction notably differs from those in the literature by sharply capturing a scaling of $n-2k$ in the minimax estimation rate of the location.

Optimal estimation of the null distribution in large-scale inference

TL;DR

This work addresses the problem of estimating the null distribution parameters

under a Gaussian two-groups model with sparse nonnull effects, formalizing a regime where

nonnulls may be a substantial fraction of

. By exploiting the Gaussian structure via a Fourier-based empirical characteristic function approach, the authors derive sharp minimax rates for both location and scale estimation, showing that consistent estimation of

is possible iff

, while

is estimable for all

. They provide matching upper and lower bounds, achieve faster-than-Huber rates in several regimes, and develop adaptive procedures (Lepski-type) that attain minimax optimality without knowing the sparsity level

, with extensions to non-Gaussian noise and total-variation distances. The results offer a principled characterization of when robust null-distribution estimation is possible in large-scale inference and demonstrate practical, computation-friendly methods for accurate empirical null estimation in dense settings. This has direct implications for genome-wide studies and other high-throughput applications where a nontrivially large proportion of signals may be present.

Abstract

The advent of large-scale inference has spurred reexamination of conventional statistical thinking. In a Gaussian model for

many

-scores with at most

nonnulls, Efron suggests estimating the location and scale parameters of the null distribution. Placing no assumptions on the nonnull effects, the statistical task can be viewed as a robust estimation problem. However, the best known robust estimators fail to be consistent in the regime

which is especially relevant in large-scale inference. The failure of estimators which are minimax rate-optimal with respect to other formulations of robustness (e.g. Huber's contamination model) might suggest the impossibility of consistent estimation in this regime and, consequently, a major weakness of Efron's suggestion. A sound evaluation of Efron's model thus requires a complete understanding of consistency. We sharply characterize the regime of

for which consistent estimation is possible and further establish the minimax estimation rates. It is shown consistent estimation of the location parameter is possible if and only if

, and consistent estimation of the scale parameter is possible in the entire regime

. Faster rates than those in Huber's contamination model are achievable by exploiting the Gaussian character of the data. The minimax upper bound is obtained by considering estimators based on the empirical characteristic function. The minimax lower bound involves constructing two marginal distributions whose characteristic functions match on a wide interval containing zero. The construction notably differs from those in the literature by sharply capturing a scaling of

in the minimax estimation rate of the location.

Paper Structure (30 sections, 36 theorems, 212 equations)

This paper contains 30 sections, 36 theorems, 212 equations.

Introduction
Robust statistics
Large-scale inference
Main contribution
Notation
A Fourier-based estimator
Methodology: unknown variance
A pilot variance estimator
A rate-optimal variance estimator
A variance-adaptive location estimator
Lower bounds
Location estimation
Variance estimation
Discussion
Inconsistent, yet rate-optimal estimation for n-2k <= sqrt(n)
...and 15 more sections

Key Result

Lemma 2.1

If $k < \frac{n}{2}$, then

Theorems & Definitions (60)

Remark 1: The effect of the data's Gaussian character
Remark 2: Comparison to a one-sided version
Lemma 2.1
proof
Theorem 2.2
Remark 3
Remark 4: Computation
Proposition 3.1: kotekal_sparsity_2023
Theorem 3.2
Proposition 3.3
...and 50 more

Optimal estimation of the null distribution in large-scale inference

TL;DR

Abstract

Optimal estimation of the null distribution in large-scale inference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (60)