Table of Contents
Fetching ...

Exact characterization of ε-Safe Decision Regions for exponential family distributions and Multi Cost SVM approximation

Alberto Carlevaro, Teodoro Alamo, Fabrizio Dabbene, Maurizio Mongelli

TL;DR

This work defines ε-Safe Decision Regions (Φ_ε) to provide probabilistic safety guarantees for binary classifiers. It proves an exact, data-driven boundary characterization for exponential-family distributions, with Φ_ε = {\boldsymbol{x} : Γ(\boldsymbol{x}) ≤ ρ(p_S,ε)} and ρ(p_S,ε) = \ln \frac{p_S}{1-p_S} + \ln \frac{ε}{1-ε}, while Γ(\boldsymbol{x}) depends only on the data through log-densities. To handle non-exponential or unbalanced data, it introduces Multi Cost SVM (MC-SVM), an ensemble SVM framework that yields a p_S-robust decision boundary, and provides multiple strategies to design the offset b to enforce the desired safety level, including bias adjustment, adjustable classifiers, and conformal-prediction-based calibration. The Gaussian special case clarifies boundary geometry (hyperplane/ellipsoid/quadric) and the paper provides a practical, reproducible code path for empirical validation. Overall, the approach advances reliable AI by linking rigorous SDR theory with scalable, data-driven approximations suitable for imbalanced and real-world datasets.

Abstract

Probabilistic guarantees on the prediction of data-driven classifiers are necessary to define models that can be considered reliable. This is a key requirement for modern machine learning in which the goodness of a system is measured in terms of trustworthiness, clearly dividing what is safe from what is unsafe. The spirit of this paper is exactly in this direction. First, we introduce a formal definition of ε-Safe Decision Region, a subset of the input space in which the prediction of a target (safe) class is probabilistically guaranteed. Second, we prove that, when data come from exponential family distributions, the form of such a region is analytically determined and controllable by design parameters, i.e. the probability of sampling the target class and the confidence on the prediction. However, the request of having exponential data is not always possible. Inspired by this limitation, we developed Multi Cost SVM, an SVM based algorithm that approximates the safe region and is also able to handle unbalanced data. The research is complemented by experiments and code available for reproducibility.

Exact characterization of ε-Safe Decision Regions for exponential family distributions and Multi Cost SVM approximation

TL;DR

This work defines ε-Safe Decision Regions (Φ_ε) to provide probabilistic safety guarantees for binary classifiers. It proves an exact, data-driven boundary characterization for exponential-family distributions, with Φ_ε = {\boldsymbol{x} : Γ(\boldsymbol{x}) ≤ ρ(p_S,ε)} and ρ(p_S,ε) = \ln \frac{p_S}{1-p_S} + \ln \frac{ε}{1-ε}, while Γ(\boldsymbol{x}) depends only on the data through log-densities. To handle non-exponential or unbalanced data, it introduces Multi Cost SVM (MC-SVM), an ensemble SVM framework that yields a p_S-robust decision boundary, and provides multiple strategies to design the offset b to enforce the desired safety level, including bias adjustment, adjustable classifiers, and conformal-prediction-based calibration. The Gaussian special case clarifies boundary geometry (hyperplane/ellipsoid/quadric) and the paper provides a practical, reproducible code path for empirical validation. Overall, the approach advances reliable AI by linking rigorous SDR theory with scalable, data-driven approximations suitable for imbalanced and real-world datasets.

Abstract

Probabilistic guarantees on the prediction of data-driven classifiers are necessary to define models that can be considered reliable. This is a key requirement for modern machine learning in which the goodness of a system is measured in terms of trustworthiness, clearly dividing what is safe from what is unsafe. The spirit of this paper is exactly in this direction. First, we introduce a formal definition of ε-Safe Decision Region, a subset of the input space in which the prediction of a target (safe) class is probabilistically guaranteed. Second, we prove that, when data come from exponential family distributions, the form of such a region is analytically determined and controllable by design parameters, i.e. the probability of sampling the target class and the confidence on the prediction. However, the request of having exponential data is not always possible. Inspired by this limitation, we developed Multi Cost SVM, an SVM based algorithm that approximates the safe region and is also able to handle unbalanced data. The research is complemented by experiments and code available for reproducibility.

Paper Structure

This paper contains 17 sections, 2 theorems, 62 equations, 3 figures.

Key Result

Proposition 2

Assume that $f(\boldsymbol{x}|S)$ and $f(\boldsymbol{x}|U)$ are of the exponential form eq:exponential, and that the points $\boldsymbol{x}$ obey density eq:f_x for given $p_S\in (0,1)$. Then, for given risk level $\varepsilon\in(0,1)$, the $\varepsilon$-SDR can be written as where

Figures (3)

  • Figure 1: $\varepsilon$-SDR for Gaussian distribution. In the first row, the $\varepsilon$-SDRs at fixed $\varepsilon = 0.5 \; (\rho_\varepsilon = 0)$ are plotted as the probability of sampling safe points varies. In the second one instead the probability of sampling is fixed at $p_S = 0.5 \; (\rho_{p_S} = 0)$ and the confidence varies. Below them the function $\Gamma(\boldsymbol{x})$ is plotted together with its level sets (that correspond to different decision boundaries, i.e. different $\varepsilon$-SDRs).
  • Figure 2: Behavior of the classification hyperplane as $\tau$ varies. The more $\tau$ are used the more similar are the $\boldsymbol{w}$. The process "saturates" with $\#\mathcal{T} = 10$.
  • Figure 3: Classification performance (for different kernels) after applying algorithm \ref{['eq: opt_prbl_primal']} for the choice of an independent $\boldsymbol{w}$ and algorithm \ref{['eq:epsilonSVM']} for the choice of $b$ such that the false positive ratio is bounded by $\varepsilon = 0.05$. Figure \ref{['fig:exact']} is the exact Gaussian $\varepsilon$-SDR.

Theorems & Definitions (8)

  • Definition 1: $\varepsilon$-Safe Decision Region
  • Proposition 2: Scaling form of $\varepsilon$-SDR
  • Example 1: $\varepsilon$-SDR form for Gaussian distribution
  • Remark 3: On $p_S$ and data unbalance
  • Proposition 4: Dual form of MC-SVM
  • Remark 5: Computational cost of MC-SVM
  • Example 2
  • Example 3