Harnessing XGBoost for Robust Biomarker Selection of Obsessive-Compulsive Disorder (OCD) from Adolescent Brain Cognitive Development (ABCD) data
Xinyu Shen, Qimin Zhang, Huili Zheng, Weiwei Qi
TL;DR
This paper tackles biomarker discovery for obsessive-compulsive disorder (OCD) using high-dimensional, highly correlated neuroimaging features from the ABCD cohort. It uses a simulation framework that mimics ABCD-like multicollinearity and non-linear feature effects to benchmark logistic regression, elastic-net, random forest, and XGBoost. XGBoost emerges as the most robust approach, delivering strong performance in simulation and recovering all predictive features among the top five; applied to real ABCD data it identifies brain-network connectivity biomarkers involving the visual system. The paper also details XGBoost's objective and additive ensemble formulation, illustrating why regularization and out-of-core capabilities improve handling of large, correlated feature sets. These results support using XGBoost for high-dimensional neuroimaging biomarker discovery in adolescents and inform interpretations of OCD-related brain connectivity.
Abstract
This study evaluates the performance of various supervised machine learning models in analyzing highly correlated neural signaling data from the Adolescent Brain Cognitive Development (ABCD) Study, with a focus on predicting obsessive-compulsive disorder scales. We simulated a dataset to mimic the correlation structures commonly found in imaging data and evaluated logistic regression, elastic networks, random forests, and XGBoost on their ability to handle multicollinearity and accurately identify predictive features. Our study aims to guide the selection of appropriate machine learning methods for processing neuroimaging data, highlighting models that best capture underlying signals in high feature correlations and prioritize clinically relevant features associated with Obsessive-Compulsive Disorder (OCD).
