Conformal online model aggregation

Matteo Gasparin; Aaditya Ramdas

Conformal online model aggregation

Matteo Gasparin, Aaditya Ramdas

TL;DR

The paper tackles model selection in conformal prediction by proposing COMA, an online wrapper that aggregates multiple conformal prediction sets through data-dependent weights updated via AdaHedge. It establishes a 2α miscoverage guarantee under a negative-correlation assumption and provides regret bounds comparing performance to the best expert. COMA is extended to distribution-shift scenarios by coupling with adaptive conformal inference, with decentralized and centralized variants that maintain valid coverage while adapting to changing data. Empirical results in both iid and non-iid settings show that COMA often yields significantly smaller prediction sets without sacrificing coverage, making it highly suitable for distributed systems and drift-prone applications.

Abstract

Conformal prediction equips machine learning models with a reasonable notion of uncertainty quantification without making strong distributional assumptions. It wraps around any prediction model and converts point predictions into set predictions with a predefined marginal coverage guarantee. However, conformal prediction only works if we fix the underlying machine learning model in advance. A relatively unaddressed issue in conformal prediction is that of model selection and/or aggregation: given a set of prediction models, which one should we conformalize? This paper suggests that instead of performing model selection, it can be prudent and practical to perform conformal set aggregation in an online, adaptive fashion. We propose a wrapper that takes in several conformal prediction sets (themselves wrapped around black-box prediction models), and outputs a single adaptively-combined prediction set. Our method, called conformal online model aggregation (COMA), is based on combining the prediction sets from several algorithms by weighted voting, and can be thought of as a sort of online stacking of the underlying conformal sets. As long as the input sets have (distribution-free) coverage guarantees, COMA retains coverage guarantees, under a negative correlation assumption between errors and weights. We verify that the assumption holds empirically in all settings considered. COMA is well-suited for decentralized or distributed settings, where different users may have different models, and are only willing to share their prediction sets for a new test point in a black-box fashion. As we demonstrate, it is also well-suited to settings with distribution drift and shift, where model selection can be imprudent.

Conformal online model aggregation

TL;DR

Abstract

Paper Structure (29 sections, 5 theorems, 36 equations, 11 figures, 9 tables, 3 algorithms)

This paper contains 29 sections, 5 theorems, 36 equations, 11 figures, 9 tables, 3 algorithms.

Introduction
Related work
Summary of contributions
Paper outline.
Problem setup
Weighted majority with data-dependent weights
The COMA meta-algorithm
Loss function definition
Employing AdaHedge within COMA
COMA under distribution shift
Adaptive conformal inference methods
Decentralized COMA under distribution shift
Adaptive conformal inference directly applied on dynamic merging
Experimental results
Experiments in iid setting
...and 14 more sections

Key Result

Theorem 3.2

Let $\mathcal{C}_1^{(t)}, \dots, \mathcal{C}_K^{(t)}$ be $K \geq 2$ different conformal prediction sets for $Y^{(t)}$, let $W^{(t)}$ be a random vector in $\Delta_{K-1}$ depending only on $\{Z^{(i)}\}_{i=1}^{t-1}$, and consider the weighted majority vote set $\mathcal{C}_M^{(t)}$ in eq:cm. It always

Figures (11)

Figure 1: Graphical summary of the COMA method. Conformal prediction wraps around the individual prediction algorithms, while COMA operates as a higher-level wrapper that aggregates these sets
Figure 2: Hedge loss ($h_t$) obtained during various iterations with either a constant or adaptive learning rate scheme. The case $\eta = 0$ coincides with the standard (non-weighted) majority vote. The case with $K=4$ algorithms is shown in the left plot, while the case with $K=5$ algorithms is shown in the right plot. The series have been smoothed using a moving average $(5, \frac{1}{5})$. In both cases, COMA with adaptive $\eta$ quickly achieves the smallest loss
Figure 3: Weights assumed by the regression algorithms during different iterations. The case with $K=4$ algorithms is shown in the top row, while the case with $K=5$ algorithms is shown in the bottom row. After few iterations, the strategy with adaptive $\eta$ assigns full weight to the random forest. The COMA method with fixed $\eta$ is more conservative in assigning weights
Figure 4: Weights assumed by the classification algorithms during different iterations. After few iteration the strategy with adaptive $\eta$ puts unit mass on LDA, the best-performing algorithm
Figure 5: Empirical total elementwise correlation, computed as the empirical sum over time $(T)$ of the covariances between $\{w_{k}^{(t)}\}_{t=1}^T$ and $\{\phi_{k}^{(t)}\}_{t=1}^T$. Both series remain near zero over all iterations
...and 6 more figures

Theorems & Definitions (12)

Theorem 3.2
proof
Remark 3.3
Remark 3.4
Remark 3.5
Lemma 3.6
proof
Lemma 4.1
proof
Theorem 4.2
...and 2 more

Conformal online model aggregation

TL;DR

Abstract

Conformal online model aggregation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (12)