Table of Contents
Fetching ...

A Statistical View of Column Subset Selection

Anav Sood, Trevor Hastie

TL;DR

This work unifies CSS and Principal Variables by showing their exact equivalence under a covariance-based lens and embedding both in a semi-parametric PCSS generative model. It proves a high-dimensional consistency result for the CSS/MLE link, and develops practical, scalable algorithms to perform CSS using only covariance or summary statistics, including when data are missing or censored. The authors introduce a subset-size selection procedure rooted in a likelihood-ratio framework and demonstrate the approach with real data (e.g., BlackRock diversification, ozone detection, Big Five survey) and provide a Python package pycss for practitioners. Overall, the paper delivers a theoretically grounded, computationally efficient framework for interpretable, covariate-based dimensionality reduction with broad applicability in high-dimensional settings.

Abstract

We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.

A Statistical View of Column Subset Selection

TL;DR

This work unifies CSS and Principal Variables by showing their exact equivalence under a covariance-based lens and embedding both in a semi-parametric PCSS generative model. It proves a high-dimensional consistency result for the CSS/MLE link, and develops practical, scalable algorithms to perform CSS using only covariance or summary statistics, including when data are missing or censored. The authors introduce a subset-size selection procedure rooted in a likelihood-ratio framework and demonstrate the approach with real data (e.g., BlackRock diversification, ozone detection, Big Five survey) and provide a Python package pycss for practitioners. Overall, the paper delivers a theoretically grounded, computationally efficient framework for interpretable, covariate-based dimensionality reduction with broad applicability in high-dimensional settings.

Abstract

We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
Paper Structure (77 sections, 23 theorems, 186 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 77 sections, 23 theorems, 186 equations, 8 figures, 4 tables, 2 algorithms.

Key Result

Proposition 2.1

Consider a matrix $\boldsymbol{X} \in \mathbb{R}^{n \times p}$ and define $\hat{\boldsymbol{\Sigma}} = \boldsymbol{X}^\top\boldsymbol{X}/n$. For any size-$k$ subset $S \subseteq [p]$, the CSS objective eq:css with $\boldsymbol{X}$ and the Principal Variables objective eq:pv with $\hat{\boldsymbol{\

Figures (8)

  • Figure 1: For the BlackRock example discussed in \ref{['sec:blackrock']}, the average $R^2$ from the regression of each variable on the selected subset (resp. principal components) versus increasing subset size (resp. number of principal components).
  • Figure 2: For different amounts of missingness $q$ and subset sizes $k$, the selected subset's CSS objective value on the fully observed Ozone Level Detection data over the hundred trials described in \ref{['sec:missing_real']}. The red line marks the CSS objective value attained by the subset that was selected using the fully observed data.
  • Figure 3: For one thousand trials, the size of the selected subset $\hat{S}$ (left), the size of the intersection of $\hat{S}$ and the population subset $S$ (middle), and the population sum of squared canonical correlations between the variables in $\hat{S}$ and $S$ (right) for the different simulation settings described in \ref{['sec:model_selection_sim']}. Color indicates the average population $R^2$ from the regression of the variables not in $S$ on those in $S$.
  • Figure 4: For the five personality sub-surveys, we give Cronbach's coefficient $\alpha$ with 95% bootstrap CIs ($B=1000$ bootstrap samples) for our survey and the Norwegian Engvik, Brazilian Gouveia, and Serbian Tucakovic 20-question surveys. The x-axis gives our number of selected questions per sub-survey. All other surveys chose four questions per sub-survey.
  • Figure 5: Results for the case of Gaussian unique factors. For description of the plots, see the caption of \ref{['fig:model_selec_sim']}
  • ...and 3 more figures

Theorems & Definitions (27)

  • Proposition 2.1: CSS and Principal Variables are equivalent
  • Proposition 2.2: The CSS estimand
  • Theorem 1: CSS solution is an MLE
  • Theorem 2: High-dimensional consistency of CSS
  • Theorem 3: The subset factor model compromise
  • Lemma 4.1: Efficient residual covariance update
  • Theorem 4: Error control
  • Lemma B.1
  • Lemma B.2
  • Lemma B.3
  • ...and 17 more