Valid Inference for Machine Learning Model Parameters

Neil Dey; Jonathan P. Williams

Valid Inference for Machine Learning Model Parameters

Neil Dey, Jonathan P. Williams

TL;DR

This work tackles uncertainty quantification for machine learning model parameters, specifically the risk minimizer $θ_0$, using only training data and under weak distributional assumptions. It introduces an inferential framework that relies on a uniform convergence property and defines the set of $ε$-almost ERMs $Θ_S^ε$ as finite-sample, $1-α$ level confidence sets for $θ_0$, with extensions to noncompact parameter spaces through neighborhoods $Θ_0^δ$. By viewing these confidence sets as random sets, the authors adopt imprecise-probability concepts and develop belief and plausibility functions, enabling region-level inference; bootstrapping provides practical approximations to the distribution of these sets and yields asymptotically valid plausibilities and p-values for hypotheses about $θ_0$. A key comparison with Generalized Inferential Models shows that the proposed approach delivers finite-sample guarantees and explicit power behavior, while requiring weaker modeling assumptions. Overall, the framework offers principled, region-specific hypothesis testing and tuning-parameter inference for ML models without relying on strong population-level distributional information, with broad applicability across models that satisfy uniform convergence.

Abstract

The parameters of a machine learning model are typically learned by minimizing a loss function on a set of training data. However, this can come with the risk of overtraining; in order for the model to generalize well, it is of great importance that we are able to find the optimal parameter for the model on the entire population -- not only on the given training sample. In this paper, we construct valid confidence sets for this optimal parameter of a machine learning model, which can be generated using only the training data without any knowledge of the population. We then show that studying the distribution of this confidence set allows us to assign a notion of confidence to arbitrary regions of the parameter space, and we demonstrate that this distribution can be well-approximated using bootstrapping techniques.

Valid Inference for Machine Learning Model Parameters

TL;DR

This work tackles uncertainty quantification for machine learning model parameters, specifically the risk minimizer

, using only training data and under weak distributional assumptions. It introduces an inferential framework that relies on a uniform convergence property and defines the set of

-almost ERMs

as finite-sample,

level confidence sets for

, with extensions to noncompact parameter spaces through neighborhoods

. By viewing these confidence sets as random sets, the authors adopt imprecise-probability concepts and develop belief and plausibility functions, enabling region-level inference; bootstrapping provides practical approximations to the distribution of these sets and yields asymptotically valid plausibilities and p-values for hypotheses about

. A key comparison with Generalized Inferential Models shows that the proposed approach delivers finite-sample guarantees and explicit power behavior, while requiring weaker modeling assumptions. Overall, the framework offers principled, region-specific hypothesis testing and tuning-parameter inference for ML models without relying on strong population-level distributional information, with broad applicability across models that satisfy uniform convergence.

Abstract

Paper Structure (10 sections, 10 theorems, 96 equations, 6 figures)

This paper contains 10 sections, 10 theorems, 96 equations, 6 figures.

Introduction
The Supervised Learning Problem
An Inferential Framework for Machine Learning
Examples
Efficient Assignment of Confidence
Random Sets and Imprecise Probability
Validity of ML Models
Bootstrapping Belief and Plausibility
Comparison to Generalized Inferential Models
Concluding Remarks and Future Work

Key Result

Theorem 8

Let $(\mathcal{H}, L)$ have uniform convergence function $f$. Suppose that the risk minimizer $\theta_0$ exists. Then $\widehat{\Theta}_S^\varepsilon$ is a $1-\alpha$ level confidence set for $\theta_0$ if $m\geq f(\varepsilon/2, \alpha)$.

Figures (6)

Figure 1: To quantify our uncertainty in the ERM $\widehat{\theta}_S$, we look at the size of the $\varepsilon$-neighborhood $\widehat{\Theta}_S^\varepsilon$ around $\widehat{\theta}_S$ that is intended to cover the risk minimizer $\theta_0$.
Figure 2: Illustrations of the desired behavior of $\widehat{\Theta}_S^\varepsilon$ when the parameter space $\Theta$ is not compact. In (a), the risk minimizer $\theta_0$ and the ERM $\widehat{\theta}_S$ both exist; $\widehat{\Theta}_S^\varepsilon$ thus includes $\theta_0$. In (b), neither $\theta_0$ nor $\widehat{\theta}$ exist in the parameter space; $\widehat{\Theta}_S^\varepsilon$ instead contains the closed ball $\Theta_0^\delta$.
Figure 3: The polynomials induced by the risk minimizer $\theta_0$ and the ERM $\widehat{\theta}_S$ at sample sizes $m=3$, $m=4$, and $m=15$.
Figure 4: When inverting $\widehat{\Theta}_S^{\varepsilon(m, \alpha)}$ to determine our confidence in arbitrary regions of the parameter space, both regions $A$ and $B$ are assigned the same confidence $1-\alpha$. However, since $B \subseteq A$, it is clear that we should have $\operatorname{Conf}(B) \leq \operatorname{Conf}(A)$.
Figure 5: Estimates for the plausibility of the set $\{\beta \,:\, \norm{\beta}_1 \leq t'$} at significance level $0.05$ for different tuning parameters $t'$. A vertical line is plotted at the $t'$ that maintains plausibility at least 0.95. Plausibilities were estimated via Monte-Carlo simulation with $10000$ replicates.
...and 1 more figures

Theorems & Definitions (30)

Definition 1
Definition 2
Definition 3
Definition 4
Definition 5
Definition 6
Definition 7
Theorem 8
Theorem 9
Theorem 10
...and 20 more

Valid Inference for Machine Learning Model Parameters

TL;DR

Abstract

Valid Inference for Machine Learning Model Parameters

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (30)