Understanding Best Subset Selection: A Tale of Two C(omplex)ities
Saptarshi Roy, Ambuj Tewari, Ziwei Zhu
TL;DR
This work analyzes best subset selection in high-dimensional sparse regression by introducing an identifiability margin, $ au_*(s)$, and two geometry-driven complexity measures that capture the spaces of residualized signals and spurious projections. The key result provides a sharp sufficient condition: if $ au_*(s)$ scaled by noise dominates the maximum complexities (up to log factors), BSS recovers the true active set $ extcal{S}$ with high probability; a complementary necessary condition shows that larger complexities bound the margin needed for consistency. The framework clarifies how correlation structure shapes model discrimination and explains why some correlated designs can be more favorable to BSS than orthogonal designs. The authors also extend the analysis to GLMs, offering a principled way to assess model selection under broader link functions, and provide simulations illustrating the theory. Overall, the paper reveals that geometric complexities of residualized signals and spurious projections fundamentally govern the margin conditions for exact model recovery in BSS, guiding design considerations and future method development.
Abstract
We consider the problem of best subset selection (BSS) under high-dimensional sparse linear regression model. Recently, Guo et al. (2020) showed that the model selection performance of BSS depends on a certain identifiability margin, a measure that captures the model discriminative power of BSS under a general correlation structure that is robust to the design dependence, unlike its computational surrogates such as LASSO, SCAD, MCP, etc. Expanding on this, we further broaden the theoretical understanding of best subset selection in this paper and show that the complexities of the residualized signals, the portion of the signals orthogonal to the true active features, and spurious projections, describing the projection operators associated with the irrelevant features, also play fundamental roles in characterizing the margin condition for model consistency of BSS. In particular, we establish both necessary and sufficient margin conditions depending only on the identifiability margin and the two complexity measures. We also partially extend our sufficiency result to the case of high-dimensional sparse generalized linear models (GLMs).
