Table of Contents
Fetching ...

Statistical Agnostic Regression: a machine learning method to validate regression models

Juan M Gorriz, J. Ramirez, F. Segovia, F. J. Martinez-Murcia, C. Jiménez-Mesa, J. Suckling

TL;DR

This work tackles the lack of formal statistical significance in ML-based regression by introducing Statistical Agnostic Regression (SAR), a non-parametric test grounded in concentration inequalities that assesses the evidence for a linear relationship with a confidence level of at least $1-\eta$. SAR combines a PAC-Bayesian dropout-based risk bound with a worst-case analysis, yielding a critical threshold $\gamma$ and enabling permutation-based p-values to decide on linearity, while also providing extensions to handle heteroscedasticity via the Breusch-Pagan test. Through Gaussian, non-Gaussian, heteroscedastic, and real datasets (e.g., Cancer and ADNI), the authors show SAR often aligns with OLS in well-behaved settings and offers more robust control of false positives than standard CV-based ML validation. The framework supports combining SAR with classical tests to strengthen inference and emphasizes cautious interpretation in non-ideal data, advancing reliable regression analysis in data-rich scientific applications.

Abstract

Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources. Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection. In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least $1-η$, to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations demonstrate the ability of the proposed agnostic (non-parametric) test to provide an analysis of variance similar to the classical multivariate $F$-test for the slope parameter, without relying on the underlying assumptions of classical methods. Moreover, the residuals computed from this method represent a trade-off between those obtained from ML approaches and the classical OLS.

Statistical Agnostic Regression: a machine learning method to validate regression models

TL;DR

This work tackles the lack of formal statistical significance in ML-based regression by introducing Statistical Agnostic Regression (SAR), a non-parametric test grounded in concentration inequalities that assesses the evidence for a linear relationship with a confidence level of at least . SAR combines a PAC-Bayesian dropout-based risk bound with a worst-case analysis, yielding a critical threshold and enabling permutation-based p-values to decide on linearity, while also providing extensions to handle heteroscedasticity via the Breusch-Pagan test. Through Gaussian, non-Gaussian, heteroscedastic, and real datasets (e.g., Cancer and ADNI), the authors show SAR often aligns with OLS in well-behaved settings and offers more robust control of false positives than standard CV-based ML validation. The framework supports combining SAR with classical tests to strengthen inference and emphasizes cautious interpretation in non-ideal data, advancing reliable regression analysis in data-rich scientific applications.

Abstract

Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources. Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection. In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least , to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations demonstrate the ability of the proposed agnostic (non-parametric) test to provide an analysis of variance similar to the classical multivariate -test for the slope parameter, without relying on the underlying assumptions of classical methods. Moreover, the residuals computed from this method represent a trade-off between those obtained from ML approaches and the classical OLS.
Paper Structure (24 sections, 18 equations, 18 figures, 1 table)

This paper contains 24 sections, 18 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: In this 2D example of regression fitting, we explore effects ranging from large to very small. Observe the flatness of the linear functions in subfigure \ref{['fig:1b']}. In the regression examples of subfigures \ref{['fig:1a']} and \ref{['fig:1b']}, we present several representations: y vs. x; the result of regression; y vs. $\hat{y}$; and $\mathcal{L}$ vs. (y, $\hat{y}$). In figures \ref{['fig:1c']} and \ref{['fig:1e']}, we plot the empirical losses for all the methods using uncorrelated data within a mesh grid and compare them with the theoretical value. In the middle figure, we demonstrate how the theoretical losses under $H_0$ for all the tested methods envelop the empirical losses, except for the case of uncorrelated data where the correlation level is equal to zero.
  • Figure 2: Data transformed by rotation and scaling with a non-diagonal covariance matrix, assuming a Gaussian distribution.
  • Figure 3: Gaussian data transformed by rotation and scaling, along with cluster pruning. Colors simply indicate the applied transformations to the data and identify the clusters that were removed.
  • Figure 4: Dataset with heteroscedascity and increasing sample size in figure \ref{['fig:5a']} and SAR and F tests on linearity in figure \ref{['fig:5b']}. Assumptions needed to perform the F-test are not fulfilled as shown in the Q-Q plots in figure \ref{['fig:5c']} and residuals vs. explanatory variable plot in figure \ref{['fig:5c']}.
  • Figure 5: ADNI Dataset with multiple predictors (6) in figure \ref{['fig:ADNIa']}. The first column represents the MMSE as the observable variable. In figure \ref{['fig:ADNIb']} we represent the Q-Q plot following the analysis given in the reference provided in the text. Assumptions needed to perform the F-test are not fulfilled as shown in the Q-Q plots in figure \ref{['fig:6b']}.
  • ...and 13 more figures