Weighted Aggregation of Conformity Scores for Classification

Rui Luo; Zhixin Zhou

Weighted Aggregation of Conformity Scores for Classification

Rui Luo, Zhixin Zhou

TL;DR

This paper extends conformal prediction for multiclass classification by aggregating multiple non-conformity score functions through optimal weight learning. By formulating weighted scores and exploring four data-splitting regimes (VFCP, EFCP, DLCP, DLCP+), it provides finite-sample validity guarantees and near-oracle efficiency, grounded in VC theory with a confirmed VC-dimension upper bound of $d+1$ for the relevant subgraph classes. Theoretical results show that, under reasonable assumptions, coverage remains at $1-oldsymbol{ u}$ while the expected prediction-set size approaches the oracle benchmark as data grow, with explicit bounds for each split strategy. Empirically, the approach yields consistently smaller, valid prediction sets compared to single-score baselines across CIFAR-10/100, and it demonstrates substantial gains when combining models, supporting the practical utility of score-function and model weighting in conformal prediction.

Abstract

Conformal prediction is a powerful framework for constructing prediction sets with valid coverage guarantees in multi-class classification. However, existing methods often rely on a single score function, which can limit their efficiency and informativeness. We propose a novel approach that combines multiple score functions to improve the performance of conformal predictors by identifying optimal weights that minimize prediction set size. Our theoretical analysis establishes a connection between the weighted score functions and subgraph classes of functions studied in Vapnik-Chervonenkis theory, providing a rigorous mathematical basis for understanding the effectiveness of the proposed method. Experiments demonstrate that our approach consistently outperforms single-score conformal predictors while maintaining valid coverage, offering a principled and data-driven way to enhance the efficiency and practicality of conformal prediction in classification tasks.

Weighted Aggregation of Conformity Scores for Classification

TL;DR

for the relevant subgraph classes. Theoretical results show that, under reasonable assumptions, coverage remains at

while the expected prediction-set size approaches the oracle benchmark as data grow, with explicit bounds for each split strategy. Empirically, the approach yields consistently smaller, valid prediction sets compared to single-score baselines across CIFAR-10/100, and it demonstrates substantial gains when combining models, supporting the practical utility of score-function and model weighting in conformal prediction.

Abstract

Paper Structure (37 sections, 10 theorems, 48 equations, 6 figures, 1 table, 4 algorithms)

This paper contains 37 sections, 10 theorems, 48 equations, 6 figures, 1 table, 4 algorithms.

Introduction
Methodology
Conformal Prediction for Classification
Various Score Functions for Classifications
Averaging Score Functions
The Optimal Weight and the Threshold
Data Splitting
Theoretical Analysis
Overview of the Results
Results by Vapnik–Chervonenkis Theory
Consistency of VFCP
Consistency of EFCP
Consistency of DLCP
Consistency of DLCP+
Conclusion of Theoretical Results
...and 22 more sections

Key Result

Lemma 1

Suppose the samples in $\mathcal{I}$ are i.i.d., then

Figures (6)

Figure 1: This example illustrates a framework for data splitting into $\mathcal{I}_1, \mathcal{I}_2, \mathcal{I}_3$, and $\mathcal{I}_\text{test}$. Algorithm \ref{['alg:weight']} presents the complete procedure. Briefly, $\mathcal{I}_1$ and $\mathcal{I}_2$ are used in Steps 1-2 to select the optimal weight $\widehat{\mathbf{w}}$, while $\mathcal{I}_3$ is used in Step 3 as the calibration set for $\mathcal{I}_\text{test}$ predictions. We present four options: VFCP, EFCP, DLCP, and DLCP+. Their coverage and size properties are discussed theoretically in Section \ref{['sec:theory']} and empirically in Section \ref{['sec:experiment']}.
Figure 2: Boxplot comparison of different score functions at a significance level of $\alpha=0.01$ on CIFAR-100. Our weighted combination method achieves the guaranteed coverage of 99% while maintaining the smallest prediction set size.
Figure 3: Comparison of size vs. coverage for various score functions and our proposed method across $\alpha$ values (0.01-0.05). Our weighted combination method (red) consistently outperforms the other baseline methods by achieving the desired coverage rate with smaller prediction set sizes.
Figure 4: Across various score functions, our weighted combination of models outperformed any individual model and achieved optimal size on the CIFAR-10 dataset across $\alpha$ values (0.01–0.05).
Figure 5: Across various score functions, our weighted combination of models outperformed any individual model and achieved optimal size on the CIFAR-100 dataset across $\alpha$ values (0.01–0.05).
...and 1 more figures

Theorems & Definitions (16)

Lemma 1
Remark
Theorem 1
Theorem 2
Theorem 3
Theorem 4
Proposition 1
proof
Lemma 2
proof
...and 6 more

Weighted Aggregation of Conformity Scores for Classification

TL;DR

Abstract

Weighted Aggregation of Conformity Scores for Classification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (16)