Table of Contents
Fetching ...

Feature Selection via Maximizing Distances between Class Conditional Distributions

Chunxu Cao, Qiang Zhang

TL;DR

The paper tackles high-dimensional feature selection by directly maximizing the distance between class-conditional distributions using integral probability metrics (IPMs). It introduces a distributional-utility framework where feature subsets are scored by the Frobenius norm of a pairwise IPM distance matrix across classes, enabling model-free discrimination. A concrete instantiation based on the 1-Wasserstein distance is developed, including exact and approximate estimators, convergence properties, and three practical feature-selection algorithms (Top-$m$, Forward Add-in, and Backward Elimination). Empirical results on diverse datasets show competitive or superior classification accuracy and robustness to perturbations, validating the approach's expressiveness and applicability, with code to be released. The method offers a principled, geometry-aware alternative to traditional filter/wrapper approaches and highlights the value of distributional distances in supervised feature selection.

Abstract

For many data-intensive tasks, feature selection is an important preprocessing step. However, most existing methods do not directly and intuitively explore the intrinsic discriminative information of features. We propose a novel feature selection framework based on the distance between class conditional distributions, measured by integral probability metrics (IPMs). Our framework directly explores the discriminative information of features in the sense of distributions for supervised classification. We analyze the theoretical and practical aspects of IPMs for feature selection, construct criteria based on IPMs. We propose several variant feature selection methods of our framework based on the 1-Wasserstein distance and implement them on real datasets from different domains. Experimental results show that our framework can outperform state-of-the-art methods in terms of classification accuracy and robustness to perturbations.

Feature Selection via Maximizing Distances between Class Conditional Distributions

TL;DR

The paper tackles high-dimensional feature selection by directly maximizing the distance between class-conditional distributions using integral probability metrics (IPMs). It introduces a distributional-utility framework where feature subsets are scored by the Frobenius norm of a pairwise IPM distance matrix across classes, enabling model-free discrimination. A concrete instantiation based on the 1-Wasserstein distance is developed, including exact and approximate estimators, convergence properties, and three practical feature-selection algorithms (Top-, Forward Add-in, and Backward Elimination). Empirical results on diverse datasets show competitive or superior classification accuracy and robustness to perturbations, validating the approach's expressiveness and applicability, with code to be released. The method offers a principled, geometry-aware alternative to traditional filter/wrapper approaches and highlights the value of distributional distances in supervised feature selection.

Abstract

For many data-intensive tasks, feature selection is an important preprocessing step. However, most existing methods do not directly and intuitively explore the intrinsic discriminative information of features. We propose a novel feature selection framework based on the distance between class conditional distributions, measured by integral probability metrics (IPMs). Our framework directly explores the discriminative information of features in the sense of distributions for supervised classification. We analyze the theoretical and practical aspects of IPMs for feature selection, construct criteria based on IPMs. We propose several variant feature selection methods of our framework based on the 1-Wasserstein distance and implement them on real datasets from different domains. Experimental results show that our framework can outperform state-of-the-art methods in terms of classification accuracy and robustness to perturbations.
Paper Structure (30 sections, 3 theorems, 40 equations, 3 figures, 1 table)

This paper contains 30 sections, 3 theorems, 40 equations, 3 figures, 1 table.

Key Result

Theorem 2.1

The following function solves empirical_estimation_of_IPMs for 1-Lipschitz function class $\mathcal{F}_{W}:= \left\{ f:||f||_{L}\leq 1\right\}$, now the estimation of 1-Wasserstein distance is: where the $a_{i}^{*}$ is the solution of following linear program

Figures (3)

  • Figure 1: Classification Accuracy.
  • Figure 2: Relative standard deviation.
  • Figure 3: Experiments on nMNIST-AWGN. Figure (a) shows some randomly selected images from different classes of the nMNIST-AWGN dataset. Figure (b) shows the average accuracy achieved by the feature subsets obtained from feature selection on the nMNIST-AWGN dataset. Figure (c) shows the RSD values for different feature selection methods.

Theorems & Definitions (3)

  • Theorem 2.1: Estimator of the Wasserstein distance sriperumbudur2012empirical
  • Theorem 2.2: Estimator of MMD sriperumbudur2012empirical
  • Theorem 2.3