Table of Contents
Fetching ...

Hybrid(Penalized Regression and MLP) Models for Outcome Prediction in HDLSS Health Data

Mithra D K

TL;DR

This work tackles diabetes prediction in HDLSS health data by pairing penalized regression with a compact MLP. It develops a hybrid framework that uses stable, sparse linear feature selection to constrain a neural network, addressing overfitting and instability common in high-dimensional settings. Through a two-stage evaluation using NHANES data, the refined pipeline achieves higher recall and F1 while maintaining AUC similar to strong linear baselines, with interpretable feature importance dominated by cardiometabolic indicators. The approach provides a practical blueprint for integrating linear stability with nonlinear modeling in real-world, high-dimensional health datasets.

Abstract

I present an application of established machine learning techniques to NHANES health survey data for predicting diabetes status. I compare baseline models (logistic regression, random forest, XGBoost) with a hybrid approach that uses an XGBoost feature encoder and a lightweight multilayer perceptron (MLP) head. Experiments show the hybrid model attains improved AUC and balanced accuracy compared to baselines on the processed NHANES subset. I release code and reproducible scripts to encourage replication.

Hybrid(Penalized Regression and MLP) Models for Outcome Prediction in HDLSS Health Data

TL;DR

This work tackles diabetes prediction in HDLSS health data by pairing penalized regression with a compact MLP. It develops a hybrid framework that uses stable, sparse linear feature selection to constrain a neural network, addressing overfitting and instability common in high-dimensional settings. Through a two-stage evaluation using NHANES data, the refined pipeline achieves higher recall and F1 while maintaining AUC similar to strong linear baselines, with interpretable feature importance dominated by cardiometabolic indicators. The approach provides a practical blueprint for integrating linear stability with nonlinear modeling in real-world, high-dimensional health datasets.

Abstract

I present an application of established machine learning techniques to NHANES health survey data for predicting diabetes status. I compare baseline models (logistic regression, random forest, XGBoost) with a hybrid approach that uses an XGBoost feature encoder and a lightweight multilayer perceptron (MLP) head. Experiments show the hybrid model attains improved AUC and balanced accuracy compared to baselines on the processed NHANES subset. I release code and reproducible scripts to encourage replication.

Paper Structure

This paper contains 20 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: ROC curves for the primary models in the refined pipeline.
  • Figure 2: Permutation importance for the final hybrid model.