Table of Contents
Fetching ...

Study Features via Exploring Distribution Structure

Chunxu Cao, Qiang Zhang

TL;DR

The paper addresses redundancy in high-dimensional features by proposing a probabilistic framework that measures redundancy through distances between class-conditional distributions, notably using the $1$-Wasserstein distance. It introduces a distance-matrix utility $U(\mathcal{T})=\|\mathbf{D}^{\mathcal{T}}\|_F^2$ and extends to unsupervised feature selection via reconstruction objectives and Gromov-Wasserstein dissimilarities to compare metric-measure spaces. Empirically, the method demonstrates strong performance against standard baselines on diverse datasets, showing robustness to noise and clear advantages in identifying non-redundant, discriminative features for both supervised and unsupervised tasks. The approach offers a flexible, interpretable framework for redundancy detection and reduction with practical impact for scalable feature selection.

Abstract

In this paper, we present a novel framework for data redundancy measurement based on probabilistic modeling of datasets, and a new criterion for redundancy detection that is resilient to noise. We also develop new methods for data redundancy reduction using both deterministic and stochastic optimization techniques. Our framework is flexible and can handle different types of features, and our experiments on benchmark datasets demonstrate the effectiveness of our methods. We provide a new perspective on feature selection, and propose effective and robust approaches for both supervised and unsupervised learning problems.

Study Features via Exploring Distribution Structure

TL;DR

The paper addresses redundancy in high-dimensional features by proposing a probabilistic framework that measures redundancy through distances between class-conditional distributions, notably using the -Wasserstein distance. It introduces a distance-matrix utility and extends to unsupervised feature selection via reconstruction objectives and Gromov-Wasserstein dissimilarities to compare metric-measure spaces. Empirically, the method demonstrates strong performance against standard baselines on diverse datasets, showing robustness to noise and clear advantages in identifying non-redundant, discriminative features for both supervised and unsupervised tasks. The approach offers a flexible, interpretable framework for redundancy detection and reduction with practical impact for scalable feature selection.

Abstract

In this paper, we present a novel framework for data redundancy measurement based on probabilistic modeling of datasets, and a new criterion for redundancy detection that is resilient to noise. We also develop new methods for data redundancy reduction using both deterministic and stochastic optimization techniques. Our framework is flexible and can handle different types of features, and our experiments on benchmark datasets demonstrate the effectiveness of our methods. We provide a new perspective on feature selection, and propose effective and robust approaches for both supervised and unsupervised learning problems.
Paper Structure (18 sections, 13 equations, 9 figures, 1 table)

This paper contains 18 sections, 13 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Distance matrices of relevant and irrelevant features. Relevant feature possesses much more significant disparity than random features.
  • Figure 2: Insights provided by the distance matrices.
  • Figure 3: Distance matrices of the relative change value brought by a single feature. This figure shows the relative change value due to the addition of the features with the largest (top) and smallest disparity, respectively.
  • Figure 4: Distance matrices for redundant features. We select the top 5 relevant features at the beginning, then sequentially copy part of the selected features. (a) shows the distance matrices with the increase of redundant features, (b) shows the distance matrices after dividing the mean of each matrix.
  • Figure 5: Demo of exploring relevance and redundancy by distance matrices.
  • ...and 4 more figures