Table of Contents
Fetching ...

Data Models With Two Manifestations of Imprecision

Christian Fröhlich, Robert C. Williamson

TL;DR

The paper addresses the limitation of assuming i.i.d. data by introducing data models that allow data to be generated from a set of probability measures, thereby capturing two parallel forms of imprecision: aggregate (ir)regularity and local (ir)regularity. It develops non-stationary, locally precise (NSLP) and stationary locally imprecise (SLI) models, derives a main theorem linking cluster points of relative frequencies to the convex hull of the measure set $\mathcal{M}$, and situates these models within the imprecise-probability and generalized-LLN literature. It provides detailed comparisons to existing frameworks (notably Walley–Fine) and discusses estimation challenges, including negative results for aggregate irregularity and practical strategies for local irregularity via selection rules. The work lays a foundation for principled imprecise scoring rules and calibration tailored to these data models, with applications to dataset shift, multi-source learning, and fairness contexts where subpopulation heterogeneity matters.

Abstract

Motivated by recently emerging problems in machine learning and statistics, we propose data models which relax the familiar i.i.d. assumption. In essence, we seek to understand what it means for data to come from a set of probability measures. We show that our frequentist data models, parameterized by such sets, manifest two aspects of imprecision. We characterize the intricate interplay of these manifestations, aggregate (ir)regularity and local (ir)regularity, where a much richer set of behaviours compared to an i.i.d. model is possible. In doing so we shed new light on the relationship between non-stationary, locally precise and stationary, locally imprecise data models. We discuss possible applications of these data models in machine learning and how the set of probabilities can be estimated. For the estimation of aggregate irregularity, we provide a negative result but argue that it does not warrant pessimism. Understanding these frequentist aspects of imprecise probabilities paves the way for deriving generalization of proper scoring rules and calibration to the imprecise case, which can then contribute to tackling practical problems.

Data Models With Two Manifestations of Imprecision

TL;DR

The paper addresses the limitation of assuming i.i.d. data by introducing data models that allow data to be generated from a set of probability measures, thereby capturing two parallel forms of imprecision: aggregate (ir)regularity and local (ir)regularity. It develops non-stationary, locally precise (NSLP) and stationary locally imprecise (SLI) models, derives a main theorem linking cluster points of relative frequencies to the convex hull of the measure set , and situates these models within the imprecise-probability and generalized-LLN literature. It provides detailed comparisons to existing frameworks (notably Walley–Fine) and discusses estimation challenges, including negative results for aggregate irregularity and practical strategies for local irregularity via selection rules. The work lays a foundation for principled imprecise scoring rules and calibration tailored to these data models, with applications to dataset shift, multi-source learning, and fairness contexts where subpopulation heterogeneity matters.

Abstract

Motivated by recently emerging problems in machine learning and statistics, we propose data models which relax the familiar i.i.d. assumption. In essence, we seek to understand what it means for data to come from a set of probability measures. We show that our frequentist data models, parameterized by such sets, manifest two aspects of imprecision. We characterize the intricate interplay of these manifestations, aggregate (ir)regularity and local (ir)regularity, where a much richer set of behaviours compared to an i.i.d. model is possible. In doing so we shed new light on the relationship between non-stationary, locally precise and stationary, locally imprecise data models. We discuss possible applications of these data models in machine learning and how the set of probabilities can be estimated. For the estimation of aggregate irregularity, we provide a negative result but argue that it does not warrant pessimism. Understanding these frequentist aspects of imprecise probabilities paves the way for deriving generalization of proper scoring rules and calibration to the imprecise case, which can then contribute to tackling practical problems.
Paper Structure (19 sections, 36 theorems, 97 equations)

This paper contains 19 sections, 36 theorems, 97 equations.

Key Result

Proposition 3.3

Let $\omega^\infty=(\omega_1,\omega_2,..)$ a data sequence and $\ell \in L^\infty$ any gamble. Then it holds:

Theorems & Definitions (73)

  • Definition 2.1: The i.i.d. model
  • Definition 2.2
  • Definition 2.3
  • Example 3.1
  • Example 3.2
  • Proposition 3.3: ivanenko2017expected,frohlich2024strictly
  • Theorem 3.4
  • Definition 3.5
  • Definition 3.6
  • Corollary 3.7
  • ...and 63 more