Optimal Inference After Model Selection
William Fithian, Dennis Sun, Jonathan Taylor
TL;DR
This work formalizes inference after adaptive model selection by conditioning on the selection event and controlling the selective type I error, enabling valid long-run properties for post-selection hypotheses. It unifies selective inference with Lehmann–Scheffé optimality in exponential families and develops powerful selective tests and confidence intervals, including new selective z- and t-tests for linear regression and data-carving strategies that outperform data splitting. It provides computational tools (Monte Carlo methods and sampling) and extends the framework to non-Gaussian settings (clinical trials, Poisson scans, GLMs) with simulations demonstrating selection-inference tradeoffs. The discussion clarifies conceptual issues around randomness and interpretation, and highlights the scalability of conditioning-based inference to discipline-wide multiple-inference tasks.
Abstract
To perform inference after model selection, we propose controlling the selective type I error; i.e., the error rate of a test given that it was performed. By doing so, we recover long-run frequency properties among selected hypotheses analogous to those that apply in the classical (non-adaptive) context. Our proposal is closely related to data splitting and has a similar intuitive justification, but is more powerful. Exploiting the classical theory of Lehmann and Scheffé (1955), we derive most powerful unbiased selective tests and confidence intervals for inference in exponential family models after arbitrary selection procedures. For linear regression, we derive new selective z-tests that generalize recent proposals for inference after model selection and improve on their power, and new selective t-tests that do not require knowledge of the error variance.
