Table of Contents
Fetching ...

Probabilistic Calibration by Design for Neural Network Regression

Victor Dheur, Souhaib Ben Taieb

TL;DR

This work introduces a novel end-to-end model training procedure called Quantile Recalibration Training, integrating post-hoc calibration directly into the training process without additional parameters, and presents a unified algorithm that includes this method and other post-hoc and regularization methods, as particular cases.

Abstract

Generating calibrated and sharp neural network predictive distributions for regression problems is essential for optimal decision-making in many real-world applications. To address the miscalibration issue of neural networks, various methods have been proposed to improve calibration, including post-hoc methods that adjust predictions after training and regularization methods that act during training. While post-hoc methods have shown better improvement in calibration compared to regularization methods, the post-hoc step is completely independent of model training. We introduce a novel end-to-end model training procedure called Quantile Recalibration Training, integrating post-hoc calibration directly into the training process without additional parameters. We also present a unified algorithm that includes our method and other post-hoc and regularization methods, as particular cases. We demonstrate the performance of our method in a large-scale experiment involving 57 tabular regression datasets, showcasing improved predictive accuracy while maintaining calibration. We also conduct an ablation study to evaluate the significance of different components within our proposed method, as well as an in-depth analysis of the impact of the base model and different hyperparameters on predictive accuracy.

Probabilistic Calibration by Design for Neural Network Regression

TL;DR

This work introduces a novel end-to-end model training procedure called Quantile Recalibration Training, integrating post-hoc calibration directly into the training process without additional parameters, and presents a unified algorithm that includes this method and other post-hoc and regularization methods, as particular cases.

Abstract

Generating calibrated and sharp neural network predictive distributions for regression problems is essential for optimal decision-making in many real-world applications. To address the miscalibration issue of neural networks, various methods have been proposed to improve calibration, including post-hoc methods that adjust predictions after training and regularization methods that act during training. While post-hoc methods have shown better improvement in calibration compared to regularization methods, the post-hoc step is completely independent of model training. We introduce a novel end-to-end model training procedure called Quantile Recalibration Training, integrating post-hoc calibration directly into the training process without additional parameters. We also present a unified algorithm that includes our method and other post-hoc and regularization methods, as particular cases. We demonstrate the performance of our method in a large-scale experiment involving 57 tabular regression datasets, showcasing improved predictive accuracy while maintaining calibration. We also conduct an ablation study to evaluate the significance of different components within our proposed method, as well as an in-depth analysis of the impact of the base model and different hyperparameters on predictive accuracy.
Paper Structure (42 sections, 16 equations, 28 figures, 3 tables, 2 algorithms)

This paper contains 42 sections, 16 equations, 28 figures, 3 tables, 2 algorithms.

Figures (28)

  • Figure 1: Comparison of QRT and BASE according to different metrics computed on the validation dataset. The three first columns show the decomposition of the NLL of QRT, where $\alpha = 1$ for QRT and $\alpha = 0$ for BASE. Each row represents one dataset and each column one metric. The training curves are averaged over 5 runs and the shaded area corresponds to one standard error. The vertical bars represent the epoch that was selected by early stopping (the one that minimizes the validation NLL), averaged over the 5 runs. The horizontal bars represent the value of the metric at the selected epoch, averaged over the 5 runs.
  • Figure 2: Difference in test NLL between two post-hoc methods (QRTC and QRC) and BASE, where negative values indicate an improvement compared to BASE, averaged over 5 runs with error bars corresponding to one standard error. We observe that QRTC achieves a lower NLL than BASE and QRC on most datasets. Note that, for BASE, $F_\theta$ is trained with a larger dataset that includes the calibration data of QRTC and QRC. The experimental setup is described in \ref{['sec:experiments']}
  • Figure 3: Comparison of QRTC, QRC, QREGC and BASE, as detailed in \ref{['sec:experiments']}.
  • Figure 4: Comparison of QRTC, QRGC, QRIC, QRLC and BASE as detailed in \ref{['sec:ablation_study']}.
  • Figure 5: Same setup than the main experiments (\ref{['fig:some/without_discrete']} in the main text), except that the underlying neural networks produces a single Gaussian instead of a mixture of 3 Gaussians.
  • ...and 23 more figures