Table of Contents
Fetching ...

LCEN: A Nonlinear, Interpretable Feature Selection and Machine Learning Algorithm

Pedro Seber, Richard D. Braatz

TL;DR

LCEN addresses the need for nonlinear, interpretable, and sparse feature selection. It integrates a LASSO-based expansion, two clip steps, and elastic-net fitting to produce sparse, accurate models, capable of rediscovering physical laws from data. Across artificial and real datasets, LCEN demonstrates robustness to noise, multicollinearity, and data scarcity, often matching or surpassing dense nonlinear methods while maintaining interpretability and faster runtimes than comparable thresholded EN approaches. The approach shows practical value for critical domains and offers clear avenues for extension to classification and physics-guided modeling.

Abstract

Interpretable models can have advantages over black-box models, and interpretability is essential for the application of machine learning in critical settings, such as aviation or medicine. This article introduces the LASSO-Clip-EN (LCEN) algorithm for nonlinear, interpretable feature selection and machine learning modeling. In a wide variety of artificial and empirical datasets, LCEN constructed sparse and frequently more accurate models than other methods, including sparse, nonlinear methods, on tested datasets. LCEN was empirically observed to be robust against many issues typically present in datasets and modeling, including noise, multicollinearity, and data scarcity. As a feature selection algorithm, LCEN matched or surpassed the thresholded elastic net but was, on average, 10.3-fold faster based on our experiments. LCEN for feature selection can also rediscover multiple physical laws from empirical data. As a machine learning algorithm, when tested on processes with no known physical laws, LCEN achieved better results than many other dense and sparse methods -- including being comparable to or better than ANNs on multiple datasets.

LCEN: A Nonlinear, Interpretable Feature Selection and Machine Learning Algorithm

TL;DR

LCEN addresses the need for nonlinear, interpretable, and sparse feature selection. It integrates a LASSO-based expansion, two clip steps, and elastic-net fitting to produce sparse, accurate models, capable of rediscovering physical laws from data. Across artificial and real datasets, LCEN demonstrates robustness to noise, multicollinearity, and data scarcity, often matching or surpassing dense nonlinear methods while maintaining interpretability and faster runtimes than comparable thresholded EN approaches. The approach shows practical value for critical domains and offers clear avenues for extension to classification and physics-guided modeling.

Abstract

Interpretable models can have advantages over black-box models, and interpretability is essential for the application of machine learning in critical settings, such as aviation or medicine. This article introduces the LASSO-Clip-EN (LCEN) algorithm for nonlinear, interpretable feature selection and machine learning modeling. In a wide variety of artificial and empirical datasets, LCEN constructed sparse and frequently more accurate models than other methods, including sparse, nonlinear methods, on tested datasets. LCEN was empirically observed to be robust against many issues typically present in datasets and modeling, including noise, multicollinearity, and data scarcity. As a feature selection algorithm, LCEN matched or surpassed the thresholded elastic net but was, on average, 10.3-fold faster based on our experiments. LCEN for feature selection can also rediscover multiple physical laws from empirical data. As a machine learning algorithm, when tested on processes with no known physical laws, LCEN achieved better results than many other dense and sparse methods -- including being comparable to or better than ANNs on multiple datasets.
Paper Structure (20 sections, 10 figures, 17 tables, 1 algorithm)

This paper contains 20 sections, 10 figures, 17 tables, 1 algorithm.

Figures (10)

  • Figure 1: Test set median MSE for the "4th-degree, univariate polynomial" dataset. ALVEN results (left, reproduced from Sun-and-Braatz-2021 with permission) show that the error is monotonically increasing with noise and that the degree 4 "unbiased model" is the best at low noise levels, but is displaced by the degree 2 "biased model" at higher noise levels. On the other hand, LCEN results (right) show that the median errors converge at higher noises. Furthermore, the LCEN median errors are typically over 60% smaller than the ALVEN median errors, and the degree 4 "unbiased model" is always the best model no matter the noise. The "noise level" and "Noise variance $\sigma^2$" terms are equivalent in this figure. Fig. \ref{['SPA_comparison_interquartile']} contains interquartile ranges for the LCEN model's test MSEs.
  • Figure A1: Plots of the Matthews Correlation Coefficients (MCCs) for models tested on the "Artificial Linear" dataset with 0% noise and 25% additional false features, as written in each subfigure's title.
  • Figure A2: Plots of the Matthews Correlation Coefficients (MCCs) for models tested on the "Artificial Linear" dataset with 0% noise and 50% additional false features, as written in each subfigure's title.
  • Figure A3: Plots of the Matthews Correlation Coefficients (MCCs) for models tested on the "Artificial Linear" dataset with 0% noise and 75% additional false features, as written in each subfigure's title.
  • Figure A4: Plots of the Matthews Correlation Coefficients (MCCs) for models tested on the "Artificial Linear" dataset with 0% noise and 100% additional false features, as written in each subfigure's title.
  • ...and 5 more figures