Table of Contents
Fetching ...

Harnessing XGBoost for Robust Biomarker Selection of Obsessive-Compulsive Disorder (OCD) from Adolescent Brain Cognitive Development (ABCD) data

Xinyu Shen, Qimin Zhang, Huili Zheng, Weiwei Qi

TL;DR

This paper tackles biomarker discovery for obsessive-compulsive disorder (OCD) using high-dimensional, highly correlated neuroimaging features from the ABCD cohort. It uses a simulation framework that mimics ABCD-like multicollinearity and non-linear feature effects to benchmark logistic regression, elastic-net, random forest, and XGBoost. XGBoost emerges as the most robust approach, delivering strong performance in simulation and recovering all predictive features among the top five; applied to real ABCD data it identifies brain-network connectivity biomarkers involving the visual system. The paper also details XGBoost's objective and additive ensemble formulation, illustrating why regularization and out-of-core capabilities improve handling of large, correlated feature sets. These results support using XGBoost for high-dimensional neuroimaging biomarker discovery in adolescents and inform interpretations of OCD-related brain connectivity.

Abstract

This study evaluates the performance of various supervised machine learning models in analyzing highly correlated neural signaling data from the Adolescent Brain Cognitive Development (ABCD) Study, with a focus on predicting obsessive-compulsive disorder scales. We simulated a dataset to mimic the correlation structures commonly found in imaging data and evaluated logistic regression, elastic networks, random forests, and XGBoost on their ability to handle multicollinearity and accurately identify predictive features. Our study aims to guide the selection of appropriate machine learning methods for processing neuroimaging data, highlighting models that best capture underlying signals in high feature correlations and prioritize clinically relevant features associated with Obsessive-Compulsive Disorder (OCD).

Harnessing XGBoost for Robust Biomarker Selection of Obsessive-Compulsive Disorder (OCD) from Adolescent Brain Cognitive Development (ABCD) data

TL;DR

This paper tackles biomarker discovery for obsessive-compulsive disorder (OCD) using high-dimensional, highly correlated neuroimaging features from the ABCD cohort. It uses a simulation framework that mimics ABCD-like multicollinearity and non-linear feature effects to benchmark logistic regression, elastic-net, random forest, and XGBoost. XGBoost emerges as the most robust approach, delivering strong performance in simulation and recovering all predictive features among the top five; applied to real ABCD data it identifies brain-network connectivity biomarkers involving the visual system. The paper also details XGBoost's objective and additive ensemble formulation, illustrating why regularization and out-of-core capabilities improve handling of large, correlated feature sets. These results support using XGBoost for high-dimensional neuroimaging biomarker discovery in adolescents and inform interpretations of OCD-related brain connectivity.

Abstract

This study evaluates the performance of various supervised machine learning models in analyzing highly correlated neural signaling data from the Adolescent Brain Cognitive Development (ABCD) Study, with a focus on predicting obsessive-compulsive disorder scales. We simulated a dataset to mimic the correlation structures commonly found in imaging data and evaluated logistic regression, elastic networks, random forests, and XGBoost on their ability to handle multicollinearity and accurately identify predictive features. Our study aims to guide the selection of appropriate machine learning methods for processing neuroimaging data, highlighting models that best capture underlying signals in high feature correlations and prioritize clinically relevant features associated with Obsessive-Compulsive Disorder (OCD).
Paper Structure (9 sections, 3 equations, 4 figures)

This paper contains 9 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: Simulation flow chart. We simulated 10,000 rows with 40 features. Out of 40 features, there are 5 features which have non-linear relationship to the outcome
  • Figure 2: Simulation results
  • Figure 3: XGBoost as the best classifier: Applied XGBoost into training data
  • Figure 4: Feature importance from XGBoost on ABCD data