On the Limitation of Kernel Dependence Maximization for Feature Selection

Keli Liu; Feng Ruan

On the Limitation of Kernel Dependence Maximization for Feature Selection

Keli Liu, Feng Ruan

TL;DR

This work shows that maximizing the Hilbert-Schmidt Independence Criterion (HSIC) over feature subsets can fail to recover all variables necessary to explain the response, even at the population level. By constructing a population distribution $\mathbb{P}_\Delta$ with two binary features, the authors demonstrate that HSIC can favor a subset that omits a functionally important feature, yielding $\mathbb{E}[Y|X] \neq \mathbb{E}[Y|X_S]$ and $\mathcal{L}(Y|X) \neq \mathcal{L}(Y|X_S)$. They provide both a discrete and a continuous-weight formulation of the problem and prove these counterexamples for $p=2$ and extend to general $p\ge 2$ via a padding argument. The results highlight a fundamental tradeoff: the convenience and scalability of dependence-maximization methods (like HSIC) come at the cost of potentially missing variables essential for full explainability, in contrast to conditional-dependence minimization approaches that guarantee full explainability under nonparametric assumptions. Implications extend to related kernel-based methods and MMD-based feature selection for binary labels, motivating caution and further methodological development.

Abstract

A simple and intuitive method for feature selection consists of choosing the feature subset that maximizes a nonparametric measure of dependence between the response and the features. A popular proposal from the literature uses the Hilbert-Schmidt Independence Criterion (HSIC) as the nonparametric dependence measure. The rationale behind this approach to feature selection is that important features will exhibit a high dependence with the response and their inclusion in the set of selected features will increase the HSIC. Through counterexamples, we demonstrate that this rationale is flawed and that feature selection via HSIC maximization can miss critical features.

On the Limitation of Kernel Dependence Maximization for Feature Selection

TL;DR

with two binary features, the authors demonstrate that HSIC can favor a subset that omits a functionally important feature, yielding

and

. They provide both a discrete and a continuous-weight formulation of the problem and prove these counterexamples for

and extend to general

via a padding argument. The results highlight a fundamental tradeoff: the convenience and scalability of dependence-maximization methods (like HSIC) come at the cost of potentially missing variables essential for full explainability, in contrast to conditional-dependence minimization approaches that guarantee full explainability under nonparametric assumptions. Implications extend to related kernel-based methods and MMD-based feature selection for binary labels, motivating caution and further methodological development.

Abstract

Paper Structure (16 sections, 10 theorems, 51 equations)

This paper contains 16 sections, 10 theorems, 51 equations.

Introduction
Problem Setup: Definition, Notation and Assumption
Main Results
Connections to Prior Literature
Proofs
The case $p = 2$
Definition of $\mathbb{P}_\Delta$
Evaluation of HSIC
Properties of $L_\Delta$
Finalizing Arguments
Proof of Theorem \ref{['theorem:inconsistency-HSIC']}
Proof of Theorem \ref{['theorem:inconsistency-HSIC-b']}
General dimension $p \ge 2$
Proofs of Technical Lemma
Proof of Lemma \ref{['lemma:both-X_1-X_2-useful']}
...and 1 more sections

Key Result

Theorem 1.1

Assume the dimension $p \ge 2$. Given any kernel $k_X$ on $\mathbb{R}^p$ obeying Assumption assumption:kernel-representation and any kernel $k_Y$ on $\mathbb{R}$ obeying Assumption assumption:kernel-representation-Y, there exists a probability distribution $\mathbb{P}$ of $(X, Y)$ supported on $\mat In the above, the first inequality is under the $\mathcal{L}_2(\mathbb{P})$ sense.

Theorems & Definitions (11)

Definition 1.1: HSIC
Theorem 1.1
Theorem 1.2
Proposition 1
Lemma 2.1
Lemma 2.2
Lemma 2.3: Symmetry
Lemma 2.4: Monotonicity in Dominant Feature
Lemma 2.5
Lemma 2.6: Removing Weaker Signal Increases $L_\Delta$
...and 1 more

On the Limitation of Kernel Dependence Maximization for Feature Selection

TL;DR

Abstract

On the Limitation of Kernel Dependence Maximization for Feature Selection

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (11)