Table of Contents
Fetching ...

Optimal and Provable Calibration in High-Dimensional Binary Classification: Angular Calibration and Platt Scaling

Yufan Li, Pragya Sur

TL;DR

This work develops a provably calibrated calibration framework for high-dimensional binary classification with Gaussian features. It introduces angular calibration, which interpolates between informative logits and Gaussian noise based on the angle between the estimated and true weight vectors, and proves both calibration and Bregman-optimality in the proportional regime where $n/d\to c$. It further shows that Platt scaling converges to the angular predictor under suitable conditions, providing a principled high-dimensional guarantee for a widely used method. Consistent estimation of the alignment angle via observable estimation cement the practical viability of the approach. Numerical experiments reinforce the theory, demonstrating calibration improvements and robustness across simulations and semi-real tasks, with extensions to non-Gaussian designs discussed for future work.

Abstract

We study the fundamental problem of calibrating a linear binary classifier of the form $σ(\hat{w}^\top x)$, where the feature vector $x$ is Gaussian, $σ$ is a link function, and $\hat{w}$ is an estimator of the true linear weight $w^\star$. By interpolating with a noninformative $\textit{chance classifier}$, we construct a well-calibrated predictor whose interpolation weight depends on the angle $\angle(\hat{w}, w_\star)$ between the estimator $\hat{w}$ and the true linear weight $w_\star$. We establish that this angular calibration approach is provably well-calibrated in a high-dimensional regime where the number of samples and features both diverge, at a comparable rate. The angle $\angle(\hat{w}, w_\star)$ can be consistently estimated. Furthermore, the resulting predictor is uniquely $\textit{Bregman-optimal}$, minimizing the Bregman divergence to the true label distribution within a suitable class of calibrated predictors. Our work is the first to provide a calibration strategy that satisfies both calibration and optimality properties provably in high dimensions. Additionally, we identify conditions under which a classical Platt-scaling predictor converges to our Bregman-optimal calibrated solution. Thus, Platt-scaling also inherits these desirable properties provably in high dimensions.

Optimal and Provable Calibration in High-Dimensional Binary Classification: Angular Calibration and Platt Scaling

TL;DR

This work develops a provably calibrated calibration framework for high-dimensional binary classification with Gaussian features. It introduces angular calibration, which interpolates between informative logits and Gaussian noise based on the angle between the estimated and true weight vectors, and proves both calibration and Bregman-optimality in the proportional regime where . It further shows that Platt scaling converges to the angular predictor under suitable conditions, providing a principled high-dimensional guarantee for a widely used method. Consistent estimation of the alignment angle via observable estimation cement the practical viability of the approach. Numerical experiments reinforce the theory, demonstrating calibration improvements and robustness across simulations and semi-real tasks, with extensions to non-Gaussian designs discussed for future work.

Abstract

We study the fundamental problem of calibrating a linear binary classifier of the form , where the feature vector is Gaussian, is a link function, and is an estimator of the true linear weight . By interpolating with a noninformative , we construct a well-calibrated predictor whose interpolation weight depends on the angle between the estimator and the true linear weight . We establish that this angular calibration approach is provably well-calibrated in a high-dimensional regime where the number of samples and features both diverge, at a comparable rate. The angle can be consistently estimated. Furthermore, the resulting predictor is uniquely , minimizing the Bregman divergence to the true label distribution within a suitable class of calibrated predictors. Our work is the first to provide a calibration strategy that satisfies both calibration and optimality properties provably in high dimensions. Additionally, we identify conditions under which a classical Platt-scaling predictor converges to our Bregman-optimal calibrated solution. Thus, Platt-scaling also inherits these desirable properties provably in high dimensions.

Paper Structure

This paper contains 20 sections, 12 theorems, 82 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.2

Assume the link function $\sigma$ is continuous. Then, the predictor $\hat{f}_{\mathrm{ang}}$ defined in definterp is well-calibrated as $d,n \rightarrow \infty, n/d \rightarrow (0,\infty)$. That is, for any $p$ contained in the range of $\sigma$, we have that in probability where $\hat{\theta}$ is a consistent estimator for $\theta_\star$ (Cf. Proposition thmconsis).

Figures (5)

  • Figure 1: Platt scaling of a logistic ridge predictor converges to angular calibration predictor, as holdout set size increases. The plot is generated with Gaussian data with covariance $\Sigma=\frac{1}{d} \bar{\Sigma}$ where $\bar{\Sigma}_{kl}=0.5^{|k-l|}, \forall k,l \in \{1,...,d\}$, sigmoid link function in a data deficient setting where $n=1000, p=2000$. See more details in \ref{['simulation']}
  • Figure 2: Reliability plots for angular calibration and Platt scaling of a logistic ridge predictor. Left panel uses a small holdout set for Platt scaling with $n_{\mathrm{ho}}=100$; Right panel uses a large holdout set with $n_{\mathrm{ho}}=2000$. The plot is generated with Gaussian data with covariance $\Sigma=\frac{1}{d} \bar{\Sigma}$ where $\bar{\Sigma}_{kl}=0.5^{|k-l|}, \forall k,l \in \{1,...,d\}$, sigmoid link function in a data deficient setting where $n=1000, p=2000$. See \ref{['simulation']} for more details.
  • Figure 3: Reproduce \ref{['fig:asymp']} (in the third column) and \ref{['fig:combinedcalib']} (in first two columns) for Rademacher entries. Upper Row: rerun simulations in \ref{['simulation']} but with subGaussian designs $W\Sigma^{1/2}$ where $W_{ij}$ are sampled iid from Rademacher distribution, taking values $+1,-1$ with equal probability. Bottom Row: we replace the sigmoid link function in \ref{['simulation']} with a clipped relu link function $\sigma(x)=\mathrm{clip}(3x+0.5)$ where $\mathrm{clip}(x)=x,\forall x\in [0,1]$, $\mathrm{clip}(x)=0, \forall x<0$ and $\mathrm{clip}(x)=1,\forall x>1$.
  • Figure 4: Reproduce \ref{['fig:asymp']} (the third column) and \ref{['fig:combinedcalib']} (first two columns) for uniform entries. Upper Row: rerun simulations in \ref{['simulation']} but with non-Gaussian designs $W\Sigma^{1/2}$ where $W_{ij}$ are sampled iid from uniform distribution, taking values in interval $[-\sqrt{12}/2, \sqrt{12}/2]$ uniformly at random. Bottom Row: we replace the sigmoid link function in \ref{['simulation']} with a clipped relu link function $\sigma(x)=\mathrm{clip}(3x+0.5)$ where $\mathrm{clip}(x)=x,\forall x\in [0,1]$, $\mathrm{clip}(x)=0, \forall x<0$ and $\mathrm{clip}(x)=1,\forall x>1$.
  • Figure :

Theorems & Definitions (21)

  • Definition 3.1
  • Theorem 3.2
  • Definition 4.1: Bregman Loss Functions
  • Theorem 4.2: Optimality of angular predictor
  • Theorem 5.1
  • Proposition 6.1
  • Corollary 6.2
  • Theorem A.1
  • proof : Proof of \ref{['mainThmPop']}
  • proof : Proof of \ref{['mainThm']}
  • ...and 11 more