Tailoring Mixup to Data for Calibration

Quentin Bouniot; Pavlo Mozharovskyi; Florence d'Alché-Buc

Tailoring Mixup to Data for Calibration

Quentin Bouniot, Pavlo Mozharovskyi, Florence d'Alché-Buc

TL;DR

This work addresses calibration gaps in Mixup-based data augmentation by introducing Similarity Kernel Mixup (SK Mixup), which warps interpolation through a similarity-driven kernel to mix similar samples more strongly while attenuating mixing for distant pairs. The method links the likelihood of label noise to manifold distance via a Wasserstein-based bound, and uses a warping function $\omega_{\tau}$ to realize Beta$(\tau,\tau)$–distributed coefficients in a computationally efficient way. A Gaussian similarity kernel computes pairwise warping parameters from batch-distance statistics, enabling distance-aware mixing in both classification (embedding-distance) and regression (label-distance) settings. Extensive experiments across image classification and regression tasks show improved calibration (lower ECE/AECE, UCE/ENCE) with competitive or better accuracy, plus substantial efficiency gains over state-of-the-art calibration-driven Mixup methods. The findings suggest SK Mixup offers a practical, scalable augmentation strategy that enhances model reliability, including under distribution shifts and OOD conditions, and can be combined with RegMixup for further gains.

Abstract

Among all data augmentation techniques proposed so far, linear interpolation of training samples, also called Mixup, has found to be effective for a large panel of applications. Along with improved predictive performance, Mixup is also a good technique for improving calibration. However, mixing data carelessly can lead to manifold mismatch, i.e., synthetic data lying outside original class manifolds, which can deteriorate calibration. In this work, we show that the likelihood of assigning a wrong label with mixup increases with the distance between data to mix. To this end, we propose to dynamically change the underlying distributions of interpolation coefficients depending on the similarity between samples to mix, and define a flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves predictive performance and calibration of models, while being much more efficient.

Tailoring Mixup to Data for Calibration

TL;DR

to realize Beta

–distributed coefficients in a computationally efficient way. A Gaussian similarity kernel computes pairwise warping parameters from batch-distance statistics, enabling distance-aware mixing in both classification (embedding-distance) and regression (label-distance) settings. Extensive experiments across image classification and regression tasks show improved calibration (lower ECE/AECE, UCE/ENCE) with competitive or better accuracy, plus substantial efficiency gains over state-of-the-art calibration-driven Mixup methods. The findings suggest SK Mixup offers a practical, scalable augmentation strategy that enhances model reliability, including under distribution shifts and OOD conditions, and can be combined with RegMixup for further gains.

Abstract

Paper Structure (52 sections, 6 theorems, 22 equations, 4 figures, 17 tables, 1 algorithm)

This paper contains 52 sections, 6 theorems, 22 equations, 4 figures, 17 tables, 1 algorithm.

Introduction
Related Work
Data augmentation based on mixing data
Calibration in classification and regression
Similarity Kernel Mixup
Manifold Mismatch
Warped Mixup
Similarity Kernel
Experiments
Protocols
Classification
Regression
Efficiency Comparison
Conclusion
Broader Impact
...and 37 more sections

Key Result

Theorem 3.1

For any pair of manifold $\mathcal{M}_i, \mathcal{M}_j$, there exists ${\mathbf{x}}_k, {\mathbf{x}}_l \in \mathcal{M}_i \cup \mathcal{M}_j$, and $\lambda_1, \lambda_2 \in [0,1]$, $\lambda_1 > \lambda_2$, such that :

Figures (4)

Figure 1: (a) Probability that predicted label of mixed samples corresponds to the label of either of the two points used for mixing, depending on the distance between the two points. (b) Performance (Accuracy in %, higher is better, bottom) and calibration (ECE, lower is better, top) comparison with Resnet34 on CIFAR10 (left) and CIFAR100 (right) datasets. We compare results when mixing only elements with distance Higher (in green) than the median, and Lower (in orange) than the median of all pairwise distances within each batch.
Figure 2: (a) Density of interpolation coefficients $\omega_\tau(\lambda)$after warping with the similarity kernel depending on the distance between pairs. (b) Probability that predicted label for mixed samples with SK mixup corresponds to the label of either of the two points used for mixing.
Figure 3: Decision frontiers and data used during training (circles) and testing (stars) for (a) ERM, (b) Mixup, and (c) our Similarity Kernel Mixup, on Moons toy dataset.
Figure 4: Decision frontiers and data used during training (circles) and testing (stars) for (a) ERM, (b) Mixup, and (c) our Similarity Kernel Mixup, on Circles toy dataset.

Theorems & Definitions (12)

Theorem 3.1
Theorem 3.2
Proposition 3.3
proof
proof
Definition F.1
Lemma F.2
Lemma F.3
Definition F.4
Lemma F.5
...and 2 more

Tailoring Mixup to Data for Calibration

TL;DR

Abstract

Tailoring Mixup to Data for Calibration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (12)