Gradient Boosting Mapping for Dimensionality Reduction and Feature Extraction

Anri Patron; Ayush Prasad; Hoang Phuc Hau Luu; Kai Puolamäki

Gradient Boosting Mapping for Dimensionality Reduction and Feature Extraction

Anri Patron, Ayush Prasad, Hoang Phuc Hau Luu, Kai Puolamäki

TL;DR

GBMAP introduces a supervised dimensionality reduction method that builds an embedding from sequential one-layer perceptron weak learners to capture directions relevant to the target task. It defines a path distance and an embedding distance in terms of the learned weak learner outputs, and shows that the embedding yields features for simple linear models that are competitive with state-of-the-art regressors/classifiers, while also enabling out-of-distribution detection and drift monitoring. The method scales to large datasets (seconds for million-point datasets) and provides interpretable features via local linear explanations. The paper demonstrates that GBMAP's embedding-based distance improves distance-based learning and can be used for supervision via differences between models.

Abstract

A fundamental problem in supervised learning is to find a good set of features or distance measures. If the new set of features is of lower dimensionality and can be obtained by a simple transformation of the original data, they can make the model understandable, reduce overfitting, and even help to detect distribution drift. We propose a supervised dimensionality reduction method Gradient Boosting Mapping (GBMAP), where the outputs of weak learners -- defined as one-layer perceptrons -- define the embedding. We show that the embedding coordinates provide better features for the supervised learning task, making simple linear models competitive with the state-of-the-art regressors and classifiers. We also use the embedding to find a principled distance measure between points. The features and distance measures automatically ignore directions irrelevant to the supervised learning task. We also show that we can reliably detect out-of-distribution data points with potentially large regression or classification errors. GBMAP is fast and works in seconds for dataset of million data points or hundreds of features. As a bonus, GBMAP provides a regression and classification performance comparable to the state-of-the-art supervised learning methods.

Gradient Boosting Mapping for Dimensionality Reduction and Feature Extraction

TL;DR

Abstract

Paper Structure (29 sections, 4 theorems, 19 equations, 8 figures, 5 tables)

This paper contains 29 sections, 4 theorems, 19 equations, 8 figures, 5 tables.

Introduction
Related Work
Theory and Methodology
Definition of Our Model
Learning the Model Parameters
Embedding of Data Points
Computational Complexity
If There Is no Non-Linearity
Relation to Other Boosting Algorithms
Numerical Experiments
Datasets and Algorithms
Scaling
Regression and Classification
Supervised Learning Features
Out-of-Distribution Detection
...and 14 more sections

Key Result

lemma thmcounterlemma

The optimization problem for $f_1$ ($j=1$) of Eq. eq:argmin reduces for ordinary least squares linear regression with $b_1=1$ for the quadratic loss and standard logistic regression for the logistic loss, both with Ridge regularization, if there is no nonlinearity, i.e., $g(z)=z$. The parameters of

Figures (8)

Figure 1: Embeddings as features for OLS regression ( qm9-10k, $R^2$, (a)) and logistic regression ( higgs-10k, accuracy, (b)). The baselines are OLS regression (a) and logistic regression (b) trained on the original data. The gbmap transformation is not restricted by the data $p$ (the number of covariates), unlike pca and lol and can be used to transform the data to arbitrary dimensions. The higgs-10k and qm9-10k has $p<32$ hence, the lines for pca and lol end at $m=16$.
Figure 2: The drift indicator against the loss figures (left) and ROC curves (right) for the gbmap (left column) and euclid (right column) drifters on cpu-small dataset. The horizontal line denotes our chosen concept drift threshold, while the vertical line indicates the drift indicator threshold that leads to maximal $F_1$ score (not used in the analysis), as in oikarinenDetectingVirtualConcept2021. The blue spheres are the data from the in-distribution set a2, and the orange crosses are the data from the out-of-distribution set b. The gbmap drifter detects drift with a high AUC of $0.91$, while the euclid has AUC of $0.84$
Figure 3: Local feature importance in terms of local linear regression coefficients in the decision of gbmap for the 50th training data point in the airquality dataset. We have omitted the intercept term from the plot.
Figure 4: gbmap drifter for regression datasets
Figure 5: euclid drifter for regression datasets
...and 3 more figures

Theorems & Definitions (8)

lemma thmcounterlemma
proof
lemma thmcounterlemma
proof
lemma thmcounterlemma
proof
lemma thmcounterlemma
proof

Gradient Boosting Mapping for Dimensionality Reduction and Feature Extraction

TL;DR

Abstract

Gradient Boosting Mapping for Dimensionality Reduction and Feature Extraction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (8)