On the Limitation of Kernel Dependence Maximization for Feature Selection
Keli Liu, Feng Ruan
TL;DR
This work shows that maximizing the Hilbert-Schmidt Independence Criterion (HSIC) over feature subsets can fail to recover all variables necessary to explain the response, even at the population level. By constructing a population distribution $\mathbb{P}_\Delta$ with two binary features, the authors demonstrate that HSIC can favor a subset that omits a functionally important feature, yielding $\mathbb{E}[Y|X] \neq \mathbb{E}[Y|X_S]$ and $\mathcal{L}(Y|X) \neq \mathcal{L}(Y|X_S)$. They provide both a discrete and a continuous-weight formulation of the problem and prove these counterexamples for $p=2$ and extend to general $p\ge 2$ via a padding argument. The results highlight a fundamental tradeoff: the convenience and scalability of dependence-maximization methods (like HSIC) come at the cost of potentially missing variables essential for full explainability, in contrast to conditional-dependence minimization approaches that guarantee full explainability under nonparametric assumptions. Implications extend to related kernel-based methods and MMD-based feature selection for binary labels, motivating caution and further methodological development.
Abstract
A simple and intuitive method for feature selection consists of choosing the feature subset that maximizes a nonparametric measure of dependence between the response and the features. A popular proposal from the literature uses the Hilbert-Schmidt Independence Criterion (HSIC) as the nonparametric dependence measure. The rationale behind this approach to feature selection is that important features will exhibit a high dependence with the response and their inclusion in the set of selected features will increase the HSIC. Through counterexamples, we demonstrate that this rationale is flawed and that feature selection via HSIC maximization can miss critical features.
