Table of Contents
Fetching ...

Conformal online model aggregation

Matteo Gasparin, Aaditya Ramdas

TL;DR

The paper tackles model selection in conformal prediction by proposing COMA, an online wrapper that aggregates multiple conformal prediction sets through data-dependent weights updated via AdaHedge. It establishes a 2α miscoverage guarantee under a negative-correlation assumption and provides regret bounds comparing performance to the best expert. COMA is extended to distribution-shift scenarios by coupling with adaptive conformal inference, with decentralized and centralized variants that maintain valid coverage while adapting to changing data. Empirical results in both iid and non-iid settings show that COMA often yields significantly smaller prediction sets without sacrificing coverage, making it highly suitable for distributed systems and drift-prone applications.

Abstract

Conformal prediction equips machine learning models with a reasonable notion of uncertainty quantification without making strong distributional assumptions. It wraps around any prediction model and converts point predictions into set predictions with a predefined marginal coverage guarantee. However, conformal prediction only works if we fix the underlying machine learning model in advance. A relatively unaddressed issue in conformal prediction is that of model selection and/or aggregation: given a set of prediction models, which one should we conformalize? This paper suggests that instead of performing model selection, it can be prudent and practical to perform conformal set aggregation in an online, adaptive fashion. We propose a wrapper that takes in several conformal prediction sets (themselves wrapped around black-box prediction models), and outputs a single adaptively-combined prediction set. Our method, called conformal online model aggregation (COMA), is based on combining the prediction sets from several algorithms by weighted voting, and can be thought of as a sort of online stacking of the underlying conformal sets. As long as the input sets have (distribution-free) coverage guarantees, COMA retains coverage guarantees, under a negative correlation assumption between errors and weights. We verify that the assumption holds empirically in all settings considered. COMA is well-suited for decentralized or distributed settings, where different users may have different models, and are only willing to share their prediction sets for a new test point in a black-box fashion. As we demonstrate, it is also well-suited to settings with distribution drift and shift, where model selection can be imprudent.

Conformal online model aggregation

TL;DR

The paper tackles model selection in conformal prediction by proposing COMA, an online wrapper that aggregates multiple conformal prediction sets through data-dependent weights updated via AdaHedge. It establishes a 2α miscoverage guarantee under a negative-correlation assumption and provides regret bounds comparing performance to the best expert. COMA is extended to distribution-shift scenarios by coupling with adaptive conformal inference, with decentralized and centralized variants that maintain valid coverage while adapting to changing data. Empirical results in both iid and non-iid settings show that COMA often yields significantly smaller prediction sets without sacrificing coverage, making it highly suitable for distributed systems and drift-prone applications.

Abstract

Conformal prediction equips machine learning models with a reasonable notion of uncertainty quantification without making strong distributional assumptions. It wraps around any prediction model and converts point predictions into set predictions with a predefined marginal coverage guarantee. However, conformal prediction only works if we fix the underlying machine learning model in advance. A relatively unaddressed issue in conformal prediction is that of model selection and/or aggregation: given a set of prediction models, which one should we conformalize? This paper suggests that instead of performing model selection, it can be prudent and practical to perform conformal set aggregation in an online, adaptive fashion. We propose a wrapper that takes in several conformal prediction sets (themselves wrapped around black-box prediction models), and outputs a single adaptively-combined prediction set. Our method, called conformal online model aggregation (COMA), is based on combining the prediction sets from several algorithms by weighted voting, and can be thought of as a sort of online stacking of the underlying conformal sets. As long as the input sets have (distribution-free) coverage guarantees, COMA retains coverage guarantees, under a negative correlation assumption between errors and weights. We verify that the assumption holds empirically in all settings considered. COMA is well-suited for decentralized or distributed settings, where different users may have different models, and are only willing to share their prediction sets for a new test point in a black-box fashion. As we demonstrate, it is also well-suited to settings with distribution drift and shift, where model selection can be imprudent.
Paper Structure (29 sections, 5 theorems, 36 equations, 11 figures, 9 tables, 3 algorithms)

This paper contains 29 sections, 5 theorems, 36 equations, 11 figures, 9 tables, 3 algorithms.

Key Result

Theorem 3.2

Let $\mathcal{C}_1^{(t)}, \dots, \mathcal{C}_K^{(t)}$ be $K \geq 2$ different conformal prediction sets for $Y^{(t)}$, let $W^{(t)}$ be a random vector in $\Delta_{K-1}$ depending only on $\{Z^{(i)}\}_{i=1}^{t-1}$, and consider the weighted majority vote set $\mathcal{C}_M^{(t)}$ in eq:cm. It always

Figures (11)

  • Figure 1: Graphical summary of the COMA method. Conformal prediction wraps around the individual prediction algorithms, while COMA operates as a higher-level wrapper that aggregates these sets
  • Figure 2: Hedge loss ($h_t$) obtained during various iterations with either a constant or adaptive learning rate scheme. The case $\eta = 0$ coincides with the standard (non-weighted) majority vote. The case with $K=4$ algorithms is shown in the left plot, while the case with $K=5$ algorithms is shown in the right plot. The series have been smoothed using a moving average $(5, \frac{1}{5})$. In both cases, COMA with adaptive $\eta$ quickly achieves the smallest loss
  • Figure 3: Weights assumed by the regression algorithms during different iterations. The case with $K=4$ algorithms is shown in the top row, while the case with $K=5$ algorithms is shown in the bottom row. After few iterations, the strategy with adaptive $\eta$ assigns full weight to the random forest. The COMA method with fixed $\eta$ is more conservative in assigning weights
  • Figure 4: Weights assumed by the classification algorithms during different iterations. After few iteration the strategy with adaptive $\eta$ puts unit mass on LDA, the best-performing algorithm
  • Figure 5: Empirical total elementwise correlation, computed as the empirical sum over time $(T)$ of the covariances between $\{w_{k}^{(t)}\}_{t=1}^T$ and $\{\phi_{k}^{(t)}\}_{t=1}^T$. Both series remain near zero over all iterations
  • ...and 6 more figures

Theorems & Definitions (12)

  • Theorem 3.2
  • proof
  • Remark 3.3
  • Remark 3.4
  • Remark 3.5
  • Lemma 3.6
  • proof
  • Lemma 4.1
  • proof
  • Theorem 4.2
  • ...and 2 more