Table of Contents
Fetching ...

Distilling Calibration via Conformalized Credal Inference

Jiayi Huang, Sangwoo Park, Nicola Paoletti, Osvaldo Simeone

TL;DR

This work tackles reliable decision-making for edge AI under tight resource constraints by distilling calibration information from a high-capacity cloud model. It introduces Conformalized Distillation for Credal Inference (CD-CI), which forms credal sets $\Gamma(x)$ around a small-edge predictor using an offline divergence threshold to guarantee, with probability $1-\epsilon$, that the cloud model’s predictive distribution $p^*(\cdot|x)$ lies within $\Gamma(x)$. A single predictive distribution is then obtained from the credal set via an intersection-probability construction, offering a robust alternative to standard low-complexity Bayesian post-processing and achieving improved ECE with negligible accuracy loss, demonstrated on CIFAR-10 and SNLI. The method leverages conformal prediction and imprecise probabilities to deliver reliable edge predictions and has practical implications for edge deployments where computational budgets prevent full Bayesian ensembling. Overall, CD-CI provides a scalable calibration mechanism that aligns edge-model outputs with cloud-model reliability guarantees while maintaining efficiency for real-world applications.

Abstract

Deploying artificial intelligence (AI) models on edge devices involves a delicate balance between meeting stringent complexity constraints, such as limited memory and energy resources, and ensuring reliable performance in sensitive decision-making tasks. One way to enhance reliability is through uncertainty quantification via Bayesian inference. This approach, however, typically necessitates maintaining and running multiple models in an ensemble, which may exceed the computational limits of edge devices. This paper introduces a low-complexity methodology to address this challenge by distilling calibration information from a more complex model. In an offline phase, predictive probabilities generated by a high-complexity cloud-based model are leveraged to determine a threshold based on the typical divergence between the cloud and edge models. At run time, this threshold is used to construct credal sets -- ranges of predictive probabilities that are guaranteed, with a user-selected confidence level, to include the predictions of the cloud model. The credal sets are obtained through thresholding of a divergence measure in the simplex of predictive probabilities. Experiments on visual and language tasks demonstrate that the proposed approach, termed Conformalized Distillation for Credal Inference (CD-CI), significantly improves calibration performance compared to low-complexity Bayesian methods, such as Laplace approximation, making it a practical and efficient solution for edge AI deployments.

Distilling Calibration via Conformalized Credal Inference

TL;DR

This work tackles reliable decision-making for edge AI under tight resource constraints by distilling calibration information from a high-capacity cloud model. It introduces Conformalized Distillation for Credal Inference (CD-CI), which forms credal sets around a small-edge predictor using an offline divergence threshold to guarantee, with probability , that the cloud model’s predictive distribution lies within . A single predictive distribution is then obtained from the credal set via an intersection-probability construction, offering a robust alternative to standard low-complexity Bayesian post-processing and achieving improved ECE with negligible accuracy loss, demonstrated on CIFAR-10 and SNLI. The method leverages conformal prediction and imprecise probabilities to deliver reliable edge predictions and has practical implications for edge deployments where computational budgets prevent full Bayesian ensembling. Overall, CD-CI provides a scalable calibration mechanism that aligns edge-model outputs with cloud-model reliability guarantees while maintaining efficiency for real-world applications.

Abstract

Deploying artificial intelligence (AI) models on edge devices involves a delicate balance between meeting stringent complexity constraints, such as limited memory and energy resources, and ensuring reliable performance in sensitive decision-making tasks. One way to enhance reliability is through uncertainty quantification via Bayesian inference. This approach, however, typically necessitates maintaining and running multiple models in an ensemble, which may exceed the computational limits of edge devices. This paper introduces a low-complexity methodology to address this challenge by distilling calibration information from a more complex model. In an offline phase, predictive probabilities generated by a high-complexity cloud-based model are leveraged to determine a threshold based on the typical divergence between the cloud and edge models. At run time, this threshold is used to construct credal sets -- ranges of predictive probabilities that are guaranteed, with a user-selected confidence level, to include the predictions of the cloud model. The credal sets are obtained through thresholding of a divergence measure in the simplex of predictive probabilities. Experiments on visual and language tasks demonstrate that the proposed approach, termed Conformalized Distillation for Credal Inference (CD-CI), significantly improves calibration performance compared to low-complexity Bayesian methods, such as Laplace approximation, making it a practical and efficient solution for edge AI deployments.
Paper Structure (22 sections, 17 equations, 14 figures, 2 algorithms)

This paper contains 22 sections, 17 equations, 14 figures, 2 algorithms.

Figures (14)

  • Figure 1: Given an input $x$, the predictive distribution ideally coincides with that of a large-scale cloud-based model $p^*(\cdot|x)$. In the setting studied in this work, a small-scale edge-based model produces a probabilistic distribution $p(\cdot|x)$ that deviates from the reference distribution $p^*(\cdot|x)$, and is thus uncalibrated. The proposed conformalized credal inference-based scheme post-processes the small-scale edge model output $p(\cdot|x)$ via a simple thresholding mechanism to produce a subset $\Gamma(x)$ in the simplex of predictive distributions, with the guarantee of containing the reference distribution $p^*(\cdot|x)$ with probability $1-\epsilon$. A final calibrated predictive distribution can be obtained via ensembling or via other combining mechanisms.
  • Figure 2: Test input $x$, reference distribution $p^*(\cdot | x)$ from the large-scale model, and credal sets produced by CD-CI for small-scale models with different accuracy on the CIFAR-10 data set with classes $\{\text{airplane, automobile, bird}\}$ using the KL divergence in (\ref{['eq:credal_set']}) with target coverage rate $1-\epsilon = 0.9$. Note that the large-model distribution $p^*(\cdot | x)$ is marked as red point in the simplex.
  • Figure 3: Coverage and inefficiency versus the small-scale models accuracy on the CIFAR-10 data set with classes $\{\text{airplane, automobile, bird}\}$ using the KL divergence in (\ref{['eq:credal_set']}) with target coverage rate $1-\epsilon = 0.9$. The accuracy of the large-scale model, ResNet-18 network, is $95.58\%$, and the accuracy of the small-scale model, Mini-VGG-8, is controlled by training over different numbers of iterations.
  • Figure 4: ECE versus target coverage rate $1-\epsilon$ for different values of $\alpha$ for the $\alpha$-divergence used in (\ref{['eq:credal_set']}) on the CIFAR-10 data set with classes $\{\text{airplane, automobile, bird}\}$. The dashed lines report the ECE performance of the large-scale model predictive distribution $p^*(\cdot|x)$, of the small-scale model $p(\cdot|x)$, and of the Laplace approximation method $q^{\text{La}}(\cdot|x)$ in (\ref{['eq:laplace_prob']}).
  • Figure 5: Accuracy versus target coverage rate $1-\epsilon$ for different values of $\alpha$ for the $\alpha$-divergence used in (\ref{['eq:credal_set']}) on the CIFAR-10 data set with classes $\{\text{airplane, automobile, bird}\}$. The dashed lines report the accuracy performance of the large-scale model predictive distribution $p^*(\cdot|x)$, and of the small-scale model $p(\cdot|x)$. Note that there is no change to accuracy when applying Laplace approximation in a post-processing way.
  • ...and 9 more figures