Table of Contents
Fetching ...

Divergence Maximizing Linear Projection for Supervised Dimension Reduction

Biao Chen, Joshua Kortje

TL;DR

This work tackles supervised dimension reduction for binary classification under a general Gaussian model by proposing two linear projection methods that maximize the KL divergence between projected class-conditional distributions. The large-μ and small-μ regimes yield complementary algorithms, with a regime-based rule guiding their use, and the analysis extends to multiclass cases under a common covariance, where multiclass LDA is shown to preserve pairwise KL divergences. Key contributions include unifying prior SDR approaches, establishing KL-optimality results for LDA under equal covariance, and providing practical procedures for regime selection and projection computation validated by numerical experiments. The methods rely only on first- and second-order statistics, are easy to implement, and offer interpretable subspaces with potential applicability to real-world Gaussian or near-Gaussian data; future work targets non-Gaussian settings and mixtures to broaden applicability.

Abstract

This paper proposes two linear projection methods for supervised dimension reduction using only the first and second-order statistics. The methods, each catering to a different parameter regime, are derived under the general Gaussian model by maximizing the Kullback-Leibler divergence between the two classes in the projected sample for a binary classification problem. They subsume existing linear projection approaches developed under simplifying assumptions of Gaussian distributions, such as these distributions might share an equal mean or covariance matrix. As a by-product, we establish that the multi-class linear discriminant analysis, a celebrated method for classification and supervised dimension reduction, is provably optimal for maximizing pairwise Kullback-Leibler divergence when the Gaussian populations share an identical covariance matrix. For the case when the Gaussian distributions share an equal mean, we establish conditions under which the optimal subspace remains invariant regardless of how the Kullback-Leibler divergence is defined, despite the asymmetry of the divergence measure itself. Such conditions encompass the classical case of signal plus noise, where both the signal and noise have zero mean and arbitrary covariance matrices. Experiments are conducted to validate the proposed solutions, demonstrate their superior performance over existing alternatives, and illustrate the procedure for selecting the appropriate linear projection solution.

Divergence Maximizing Linear Projection for Supervised Dimension Reduction

TL;DR

This work tackles supervised dimension reduction for binary classification under a general Gaussian model by proposing two linear projection methods that maximize the KL divergence between projected class-conditional distributions. The large-μ and small-μ regimes yield complementary algorithms, with a regime-based rule guiding their use, and the analysis extends to multiclass cases under a common covariance, where multiclass LDA is shown to preserve pairwise KL divergences. Key contributions include unifying prior SDR approaches, establishing KL-optimality results for LDA under equal covariance, and providing practical procedures for regime selection and projection computation validated by numerical experiments. The methods rely only on first- and second-order statistics, are easy to implement, and offer interpretable subspaces with potential applicability to real-world Gaussian or near-Gaussian data; future work targets non-Gaussian settings and mixtures to broaden applicability.

Abstract

This paper proposes two linear projection methods for supervised dimension reduction using only the first and second-order statistics. The methods, each catering to a different parameter regime, are derived under the general Gaussian model by maximizing the Kullback-Leibler divergence between the two classes in the projected sample for a binary classification problem. They subsume existing linear projection approaches developed under simplifying assumptions of Gaussian distributions, such as these distributions might share an equal mean or covariance matrix. As a by-product, we establish that the multi-class linear discriminant analysis, a celebrated method for classification and supervised dimension reduction, is provably optimal for maximizing pairwise Kullback-Leibler divergence when the Gaussian populations share an identical covariance matrix. For the case when the Gaussian distributions share an equal mean, we establish conditions under which the optimal subspace remains invariant regardless of how the Kullback-Leibler divergence is defined, despite the asymmetry of the divergence measure itself. Such conditions encompass the classical case of signal plus noise, where both the signal and noise have zero mean and arbitrary covariance matrices. Experiments are conducted to validate the proposed solutions, demonstrate their superior performance over existing alternatives, and illustrate the procedure for selecting the appropriate linear projection solution.
Paper Structure (17 sections, 7 theorems, 45 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 17 sections, 7 theorems, 45 equations, 5 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

For any $r\times d$ matrix $A$ of rank $r$, there exists an $r\times d$ orthonormal matrix $B$, i.e., the rows of $B$ are orthonormal vectors, such that the resulting KLD is identical.

Figures (5)

  • Figure 1: The $g(\cdot)$ function in (\ref{['eq:g']}).
  • Figure 2: Density plots of the projected samples using the two proposed algorithms for the large-$\mu$ case: (a) Algorithm 1; (b) Gradient descent using Algorithm 1 as initialization; (c) Algorithm 2; and (d) Gradient descent using Algorithm 2 as initialization. The contour location is chosen at $1/1000$ of the peak density value.
  • Figure 3: Density plots of the projected samples using the two proposed algorithms for the small-$\mu$ case: (a) Algorithm 1; (b) Gradient descent using Algorithm 1 as initialization; (c) Algorithm 2; and (d) Gradient descent using Algorithm 2 as initialization. The contour location is chosen at $1/1000$ of the peak density value.
  • Figure 4: Scatter plots of the projected samples using the three SDR approaches for large-$\mu$: (a) Algorithm 1, (b) Algorithm 2, (c) LoL.
  • Figure 5: Scatter plots of the projected samples using the three SDR approaches for small-$\mu$: (a) Algorithm 1, (b) Algorithm 2, and (c) LoL.

Theorems & Definitions (13)

  • Lemma 1: Theorem 1 Dwivedi:22
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • Theorem 2
  • ...and 3 more