Table of Contents
Fetching ...

A Systematic Bias of Machine Learning Regression Models and Its Correction: an Application to Imaging-based Brain Age Prediction

Hwiyoung Lee, Shuo Chen

TL;DR

This paper proposes a general constrained optimization approach designed to correct the systematic prediction bias of machine learning regression in neuroimaging-based brain age calculation, yielding unbiased predictions of brain age.

Abstract

Machine learning models for continuous outcomes often yield systematically biased predictions, particularly for values that largely deviate from the mean. Specifically, predictions for large-valued outcomes tend to be negatively biased (underestimating actual values), while those for small-valued outcomes are positively biased (overestimating actual values). We refer to this linear central tendency warped bias as the "systematic bias of machine learning regression". In this paper, we first demonstrate that this systematic prediction bias persists across various machine learning regression models, and then delve into its theoretical underpinnings. To address this issue, we propose a general constrained optimization approach designed to correct this bias and develop computationally efficient implementation algorithms. Simulation results indicate that our correction method effectively eliminates the bias from the predicted outcomes. We apply the proposed approach to the prediction of brain age using neuroimaging data. In comparison to competing machine learning regression models, our method effectively addresses the longstanding issue of "systematic bias of machine learning regression" in neuroimaging-based brain age calculation, yielding unbiased predictions of brain age.

A Systematic Bias of Machine Learning Regression Models and Its Correction: an Application to Imaging-based Brain Age Prediction

TL;DR

This paper proposes a general constrained optimization approach designed to correct the systematic prediction bias of machine learning regression in neuroimaging-based brain age calculation, yielding unbiased predictions of brain age.

Abstract

Machine learning models for continuous outcomes often yield systematically biased predictions, particularly for values that largely deviate from the mean. Specifically, predictions for large-valued outcomes tend to be negatively biased (underestimating actual values), while those for small-valued outcomes are positively biased (overestimating actual values). We refer to this linear central tendency warped bias as the "systematic bias of machine learning regression". In this paper, we first demonstrate that this systematic prediction bias persists across various machine learning regression models, and then delve into its theoretical underpinnings. To address this issue, we propose a general constrained optimization approach designed to correct this bias and develop computationally efficient implementation algorithms. Simulation results indicate that our correction method effectively eliminates the bias from the predicted outcomes. We apply the proposed approach to the prediction of brain age using neuroimaging data. In comparison to competing machine learning regression models, our method effectively addresses the longstanding issue of "systematic bias of machine learning regression" in neuroimaging-based brain age calculation, yielding unbiased predictions of brain age.
Paper Structure (9 sections, 2 theorems, 16 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 9 sections, 2 theorems, 16 equations, 3 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

Consider the outcome $Y$ with a mean of zero and variance of $\sigma^2$, $\ddot{\bold{y}}_i$ as the unbiased prediction of $\bold{y}_i$ ($i= 1,\cdots, n$), and the systematically biased machine learning regression prediction $\tilde{y_i}=\ddot{\bold{y}}_i-c\ddot{\bold{y}}_i$, where $0<c<1$. Then for

Figures (3)

  • Figure 1: The systematic bias of machine learning regression models. The simulated training and testing set each includes $n=1,000$ observations of $p=200$ features. Six machine learning methods (shown in grey panels), including Kernel Ridge Regression (KRR), LASSO Regression, XGBoost, Random Forest, Neural Network, and Support Vector Regression (SVR), are used. In each scatter plot, each dot represents a pair of predicted outcome $\widehat{\bold{y}}_i$ and true outcome ${y}_i$ and the solid line is the regression line of $\widehat{\bold{y}}_i$ and ${\bold{y}}_i$. The dotted line has a slope of 1 where $\widehat{\bold{y}}_i$ exactly equals to ${\bold{y}}_i$. The machine learning regression is biased when the solid line deviates from the dotted line. All six machine learning models of this simulation analysis demonstrate a systematic bias.
  • Figure 2: Using the simulation setup described in \ref{['Sec:Intro']}, we conducted 500 replications and displayed regression lines between the predicted response value and the true response for each replication; each line is indicated in gray. The black dashed line represents the reference line with a slope of 1 (i.e., $\widehat{\bold{y}}=\bold{y}$), serving as a benchmark for ideal predictions. The methods with constraints (shown in orange panels) correct prediction bias, whereas the unconstrained versions display a clear systematic bias in predictions.
  • Figure 3: This Scatter plot, representing one result from 100 replications, shows the observed response $\bold{y}_i$ ($x$-axis) and the residuals $\widehat{\bold{y}}_i-\bold{y}_i$ ($y$-axis) from various methods on the testing set. The reference dashed line represents $\bold{y}=0$. This figure demonstrates that the residuals from the proposed methods (KRR-UP and LASSO-UP) in the orange panels are randomly distributed, regardless of the value of the response variable. In contrast, other machine learning regression models in the gray panels exhibit a pattern in their errors, showing positive errors for low response values and negative errors for high response values.

Theorems & Definitions (2)

  • Proposition 1
  • Theorem 1