Table of Contents
Fetching ...

Uncertainty-Aware Data-Efficient AI: An Information-Theoretic Perspective

Osvaldo Simeone, Yaniv Romano

TL;DR

This article surveys uncertainty-aware AI under data scarcity, foregrounding epistemic uncertainty as the principal bottleneck and framing it through an information-theoretic lens. It synthesizes two main strategies: quantifying uncertainty via generalized Bayesian and martingale posteriors, and reducing reliance on large labeled datasets through conformal methods and synthetic-data augmentation (PPI and SPI/GESPI). The work links information-theoretic generalization bounds to practical uncertainty quantification, and presents finite-sample guarantees for prediction sets via conformal prediction and its risk-controlling extensions, while showing how synthetic data can improve both training and calibration. Collectively, these approaches enable more reliable, context-specific AI systems in robotics, telecommunications, and healthcare, with a roadmap for future work on conditional coverage, distribution shifts, and large-scale multimodal settings.

Abstract

In context-specific applications such as robotics, telecommunications, and healthcare, artificial intelligence systems often face the challenge of limited training data. This scarcity introduces epistemic uncertainty, i.e., reducible uncertainty stemming from incomplete knowledge of the underlying data distribution, which fundamentally limits predictive performance. This review paper examines formal methodologies that address data-limited regimes through two complementary approaches: quantifying epistemic uncertainty and mitigating data scarcity via synthetic data augmentation. We begin by reviewing generalized Bayesian learning frameworks that characterize epistemic uncertainty through generalized posteriors in the model parameter space, as well as ``post-Bayes'' learning frameworks. We continue by presenting information-theoretic generalization bounds that formalize the relationship between training data quantity and predictive uncertainty, providing a theoretical justification for generalized Bayesian learning. Moving beyond methods with asymptotic statistical validity, we survey uncertainty quantification methods that provide finite-sample statistical guarantees, including conformal prediction and conformal risk control. Finally, we examine recent advances in data efficiency by combining limited labeled data with abundant model predictions or synthetic data. Throughout, we take an information-theoretic perspective, highlighting the role of information measures in quantifying the impact of data scarcity.

Uncertainty-Aware Data-Efficient AI: An Information-Theoretic Perspective

TL;DR

This article surveys uncertainty-aware AI under data scarcity, foregrounding epistemic uncertainty as the principal bottleneck and framing it through an information-theoretic lens. It synthesizes two main strategies: quantifying uncertainty via generalized Bayesian and martingale posteriors, and reducing reliance on large labeled datasets through conformal methods and synthetic-data augmentation (PPI and SPI/GESPI). The work links information-theoretic generalization bounds to practical uncertainty quantification, and presents finite-sample guarantees for prediction sets via conformal prediction and its risk-controlling extensions, while showing how synthetic data can improve both training and calibration. Collectively, these approaches enable more reliable, context-specific AI systems in robotics, telecommunications, and healthcare, with a roadmap for future work on conditional coverage, distribution shifts, and large-scale multimodal settings.

Abstract

In context-specific applications such as robotics, telecommunications, and healthcare, artificial intelligence systems often face the challenge of limited training data. This scarcity introduces epistemic uncertainty, i.e., reducible uncertainty stemming from incomplete knowledge of the underlying data distribution, which fundamentally limits predictive performance. This review paper examines formal methodologies that address data-limited regimes through two complementary approaches: quantifying epistemic uncertainty and mitigating data scarcity via synthetic data augmentation. We begin by reviewing generalized Bayesian learning frameworks that characterize epistemic uncertainty through generalized posteriors in the model parameter space, as well as ``post-Bayes'' learning frameworks. We continue by presenting information-theoretic generalization bounds that formalize the relationship between training data quantity and predictive uncertainty, providing a theoretical justification for generalized Bayesian learning. Moving beyond methods with asymptotic statistical validity, we survey uncertainty quantification methods that provide finite-sample statistical guarantees, including conformal prediction and conformal risk control. Finally, we examine recent advances in data efficiency by combining limited labeled data with abundant model predictions or synthetic data. Throughout, we take an information-theoretic perspective, highlighting the role of information measures in quantifying the impact of data scarcity.

Paper Structure

This paper contains 16 sections, 31 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Key concepts on uncertainty-aware data-efficient AI reviewed in this article: (a) Bayesian learning, and post-Bayes variants thereof, describe uncertainty in the model parameter space by relying on the specification of a prior distribution and a likelihood/loss function, or possibly a predictive distribution; (b) Generalization bounds provide insights into the gap between population loss (i.e., risk) and training loss as a function of the training dataset size; (c) Conformal prediction, and variants thereof, provide means to calibrate prediction sets (e.g., error bars) so that they contain the true output with a user-specified coverage probability; and (d) Synthetic data can be leveraged to both enhance the trained predictor and improve the quality of the prediction sets.
  • Figure 2: Summary of information-theoretic relationships reviewed in this paper.
  • Figure 3: Prediction-powered inference (PPI) angelopoulos2023ppiangelopoulos2024ppipluszrnic2024cross, and the closely related doubly robust self-training approach DRjordansifaou2024semi, integrate synthetic data to enhance model training, while generalized synthetic-powered predictive inference (GESPI) bashari2025syntheticbashari2025statistical addresses model calibration.