Table of Contents
Fetching ...

Synergistic eigenanalysis of covariance and Hessian matrices for enhanced binary classification

Agus Hartoyo, Jan Argasiński, Aleksandra Trenk, Kinga Przybylska, Anna Błasiak, Alessandro Crimi

TL;DR

Binary classification benefits from leveraging both data variability and model loss curvature. The authors propose a novel 2D projection formed by the leading eigenvectors of the covariance $S(\boldsymbol{\theta})$ and the Hessian $H_{\boldsymbol{\theta}}$, combined as $\mathbf{U}=[\mathbf{v}_1,\mathbf{v}_1']$ with $\mathbf{X}_{\text{proj}}=\mathbf{X}\mathbf{U}$, to optimize the two LDA criteria: maximize squared between-class distance and minimize within-class variance. They prove two theorems (with sketches) under ideal isotropy conditions showing that maximizing covariance variance increases $d^2$ and maximizing Hessian decreases within-class variance, yielding improved class separability. Empirically, across the Wisconsin Breast Cancer, Heart Disease, Pima diabetes, and neural spike-train datasets, the method outperforms PCA, Hessian-based projections, LDA, KDA, LOL, UMAP, and LLE, while also offering interpretable 2D visualizations of DNN decision boundaries and feature contributions. The work demonstrates that a simple, provably grounded 2D projection can enhance interpretability and classification performance with a computational footprint similar to traditional dimensionality-reduction approaches.

Abstract

Covariance and Hessian matrices have been analyzed separately in the literature for classification problems. However, integrating these matrices has the potential to enhance their combined power in improving classification performance. We present a novel approach that combines the eigenanalysis of a covariance matrix evaluated on a training set with a Hessian matrix evaluated on a deep learning model to achieve optimal class separability in binary classification tasks. Our approach is substantiated by formal proofs that establish its capability to maximize between-class mean distance (the concept of \textit{separation}) and minimize within-class variances (the concept of \textit{compactness}), which together define the two linear discriminant analysis (LDA) criteria, particularly under ideal data conditions such as isotropy around class means and dominant leading eigenvalues. By projecting data into the combined space of the most relevant eigendirections from both matrices, we achieve optimal class separability as per these LDA criteria. Empirical validation across neural and health datasets consistently supports our theoretical framework and demonstrates that our method outperforms established methods. Our method stands out by addressing both separation and compactness criteria, unlike PCA and the Hessian method, which predominantly emphasize one criterion each. This comprehensive approach captures intricate patterns and relationships, enhancing classification performance. Furthermore, through the utilization of both LDA criteria, our method outperforms LDA itself by leveraging higher-dimensional feature spaces, in accordance with Cover's theorem, which favors linear separability in higher dimensions. Additionally, our approach sheds light on complex DNN decision-making, rendering them comprehensible within a 2D space.

Synergistic eigenanalysis of covariance and Hessian matrices for enhanced binary classification

TL;DR

Binary classification benefits from leveraging both data variability and model loss curvature. The authors propose a novel 2D projection formed by the leading eigenvectors of the covariance and the Hessian , combined as with , to optimize the two LDA criteria: maximize squared between-class distance and minimize within-class variance. They prove two theorems (with sketches) under ideal isotropy conditions showing that maximizing covariance variance increases and maximizing Hessian decreases within-class variance, yielding improved class separability. Empirically, across the Wisconsin Breast Cancer, Heart Disease, Pima diabetes, and neural spike-train datasets, the method outperforms PCA, Hessian-based projections, LDA, KDA, LOL, UMAP, and LLE, while also offering interpretable 2D visualizations of DNN decision boundaries and feature contributions. The work demonstrates that a simple, provably grounded 2D projection can enhance interpretability and classification performance with a computational footprint similar to traditional dimensionality-reduction approaches.

Abstract

Covariance and Hessian matrices have been analyzed separately in the literature for classification problems. However, integrating these matrices has the potential to enhance their combined power in improving classification performance. We present a novel approach that combines the eigenanalysis of a covariance matrix evaluated on a training set with a Hessian matrix evaluated on a deep learning model to achieve optimal class separability in binary classification tasks. Our approach is substantiated by formal proofs that establish its capability to maximize between-class mean distance (the concept of \textit{separation}) and minimize within-class variances (the concept of \textit{compactness}), which together define the two linear discriminant analysis (LDA) criteria, particularly under ideal data conditions such as isotropy around class means and dominant leading eigenvalues. By projecting data into the combined space of the most relevant eigendirections from both matrices, we achieve optimal class separability as per these LDA criteria. Empirical validation across neural and health datasets consistently supports our theoretical framework and demonstrates that our method outperforms established methods. Our method stands out by addressing both separation and compactness criteria, unlike PCA and the Hessian method, which predominantly emphasize one criterion each. This comprehensive approach captures intricate patterns and relationships, enhancing classification performance. Furthermore, through the utilization of both LDA criteria, our method outperforms LDA itself by leveraging higher-dimensional feature spaces, in accordance with Cover's theorem, which favors linear separability in higher dimensions. Additionally, our approach sheds light on complex DNN decision-making, rendering them comprehensible within a 2D space.
Paper Structure (27 sections, 4 theorems, 41 equations, 14 figures)

This paper contains 27 sections, 4 theorems, 41 equations, 14 figures.

Key Result

Theorem 1

Consider two sets of 1D data points representing two classes, denoted as $C_1$ and $C_2$, each consisting of $n$ samples. The data in $C_1$ and $C_2$ are centered around their respective means, $\mu_1$ and $\mu_2$. Here, $\mu$ denotes the overall mean of the combined data from $C_1$ and $C_2$. Furth where $\lambda = \frac{1}{2} \left( \frac{\sigma_1^2}{\sigma^2} + \frac{\sigma_2^2}{\sigma^2} \righ

Figures (14)

  • Figure 1: Projection of the Wisconsin breast cancer data into different combined spaces of the covariance and Hessian eigenvectors.(a) Nine selected projection plots, each representing data projected onto a distinct space created by combining the first three covariance and first three Hessian eigenvectors. (b) Heatmap showing the squared between-class mean distance for projections onto varying combinations of covariance and Hessian eigenvectors. The heatmap demonstrates that the values remain constant vertically across different Hessian eigenvectors, while exhibiting a noticeable descending order horizontally, aligning with the descending order of the variance. These results essentially concretize our formal premise, empirically validating the linear relationship described in Eq(\ref{['eq:1']}) between the variance and the squared between-class mean distance. (c) Heatmap showing the sum of within-class variances for projections onto different combinations of covariance and Hessian eigenvectors. The values remain constant horizontally across different covariance eigenvectors but exhibit a clear ascending order vertically, aligning with the descending order of the Hessian. These empirical results validate the negative correlation between the Hessian and the within-class variance described in Eq(\ref{['eq:2']}) within the framework of our theoretical foundation. (d) Heatmap displaying the LDA ratio, representing the ratio between the squared between-class mean distances presented in (b) and the corresponding within-class variances shown in (c). The highest LDA ratio is observed for the combination of the first Hessian eigenvector with the first covariance eigenvector. Notably, a general descending pattern is observed both horizontally and vertically across different combinations, indicating that both covariance eigenanalysis (represented along the horizontal direction) and Hessian eigenanalysis (represented along the vertical direction) equally contribute to the class separability (represented by the LDA ratio).
  • Figure 2: Performance comparison of data projection methods using cross-validation on four distinct datasets: (a) WBCD, (b) heart disease, (c) neural spike train, and (d) Pima Indians diabetes datasets. This figure presents the average F1 score, ROC AUC, and Cohen's Kappa values obtained through 5- or 10-fold cross-validation for nine data projection techniques: PCA, KPCA (cosine similarity for WBCD, polynomial kernel with degree=3 for Heart, linear kernel for Neural, cosine similarity for Pima), Hessian, UMAP, LLE, LOL, LDA, KDA (cosine similarity for WBCD, RBF kernel with coefficient=0.01 for Heart, sigmoid kernel with coefficient=2 for Neural, cosine similarity for Pima), and the proposed method. Notably, the proposed method consistently outperforms all other techniques, achieving the highest scores across all evaluation metrics.
  • Figure 3: Pairplot of the first four predictor attributes in the WBCD dataset, colored by class labels. This pairplot presents only four of the 30 predictor attributes, projecting the 30-dimensional data into 1D and 2D spaces between individual attributes and attribute pairs. The diagonal plots display the distribution of each attribute individually using kernel density estimation (KDE) curves, where separation is assessed based on the overlap of the KDE curves for each class. The non-diagonal plots show scatter plots of pairs of attributes, where separation between classes is evaluated based on the extent of overlap or spread of points corresponding to different class labels. The visualizations show that these projections do not provide good separability between the two classes.
  • Figure 4: Training set and test set projections on the WBCD dataset using the proposed method. The proposed method projects the high-dimensional data into the leading eigendirections of the covariance and Hessian matrices. The training set projection illustrates how the training data is separated using the proposed method, while the test set projection shows the separation of new unseen data. Visually, the projections indicate good separability between the two classes, demonstrating the effectiveness of the proposed method.
  • Figure 5: Parameter contributions to the leading eigenvectors of the covariance and Hessian matrices for the WBCD dataset. The contributions are calculated as the absolute values of the elements in the first eigenvector of the covariance and Hessian matrices, respectively. Parameters are sorted by their absolute contributions, and the horizontal bar plots display these sorted contributions for each parameter. The first plot indicates which attributes contribute the most to maximizing the between-class mean distance, while the second plot indicates which attributes contribute the most to minimizing the within-class variances.
  • ...and 9 more figures

Theorems & Definitions (6)

  • Theorem 1: Maximizing covariance for maximizing squared between-class mean distance
  • Theorem 2: Maximizing Hessian for minimizing within-class variance
  • Theorem 3: Within-class variance scaling through z-score normalization
  • proof
  • Theorem 4: Variance ratio preservation upon projection onto a vector
  • proof