Table of Contents
Fetching ...

Margin and Consistency Supervision for Calibrated and Robust Vision Models

Salim Khazem

TL;DR

MaCS is presented, a simple, architecture-agnostic regularization framework that jointly enforces logit-space separation and local prediction stability that consistently improves calibration and robustness to common corruptions while preserving or improving top-1 accuracy.

Abstract

Deep vision classifiers often achieve high accuracy while remaining poorly calibrated and fragile under small distribution shifts. We present Margin and Consistency Supervision (MaCS), a simple, architecture-agnostic regularization framework that jointly enforces logit-space separation and local prediction stability. MaCS augments cross-entropy with (i) a hinge-squared margin penalty that enforces a target logit gap between the correct class and the strongest competitor, and (ii) a consistency regularizer that minimizes the KL divergence between predictions on clean inputs and mildly perturbed views. We provide a unifying theoretical analysis showing that increasing classification margin while reducing local sensitivity formalized via a Lipschitz-type stability proxy yields improved generalization guarantees and a provable robustness radius bound scaling with the margin-to-sensitivity ratio. Across several image classification benchmarks and several backbones spanning CNNs and Vision Transformers, MaCS consistently improves calibration (lower ECE and NLL) and robustness to common corruptions while preserving or improving top-1 accuracy. Our approach requires no additional data, no architectural changes, and negligible inference overhead, making it an effective drop-in replacement for standard training objectives.

Margin and Consistency Supervision for Calibrated and Robust Vision Models

TL;DR

MaCS is presented, a simple, architecture-agnostic regularization framework that jointly enforces logit-space separation and local prediction stability that consistently improves calibration and robustness to common corruptions while preserving or improving top-1 accuracy.

Abstract

Deep vision classifiers often achieve high accuracy while remaining poorly calibrated and fragile under small distribution shifts. We present Margin and Consistency Supervision (MaCS), a simple, architecture-agnostic regularization framework that jointly enforces logit-space separation and local prediction stability. MaCS augments cross-entropy with (i) a hinge-squared margin penalty that enforces a target logit gap between the correct class and the strongest competitor, and (ii) a consistency regularizer that minimizes the KL divergence between predictions on clean inputs and mildly perturbed views. We provide a unifying theoretical analysis showing that increasing classification margin while reducing local sensitivity formalized via a Lipschitz-type stability proxy yields improved generalization guarantees and a provable robustness radius bound scaling with the margin-to-sensitivity ratio. Across several image classification benchmarks and several backbones spanning CNNs and Vision Transformers, MaCS consistently improves calibration (lower ECE and NLL) and robustness to common corruptions while preserving or improving top-1 accuracy. Our approach requires no additional data, no architectural changes, and negligible inference overhead, making it an effective drop-in replacement for standard training objectives.
Paper Structure (22 sections, 3 theorems, 16 equations, 6 figures, 11 tables, 1 algorithm)

This paper contains 22 sections, 3 theorems, 16 equations, 6 figures, 11 tables, 1 algorithm.

Key Result

Theorem 4.2

Let $f: \mathbb{R}^d \to \mathbb{R}^K$ be a neural network with spectral complexity $R_f$, and let $\mathcal{D}$ be a distribution over $\mathbb{R}^d \times [K]$. For any margin $\gamma > 0$, with probability at least $1 - \delta$ over an i.i.d. training set $S$ of size $n$ drawn from $\mathcal{D}$: where $\hat{L}_\gamma(S)$ is the fraction of training samples with margin less than $\gamma$, $B$ i

Figures (6)

  • Figure 1: Per-dataset model curves across methods. Each line corresponds to a model and traces accuracy across training objectives, highlighting method-specific gains.
  • Figure 2: Overview of MaCS training. The model processes both clean input $x$ and perturbed input $\tilde{x} = T(x)$. The total loss combines cross-entropy, a margin penalty encouraging $\gamma(x) \geq \Delta$, and a KL-based consistency term enforcing prediction stability.
  • Figure 3: Accuracy improvement of MaCS over baseline (cross-entropy) across all dataset--model configurations. Each line connects the baseline (left) to MaCS (right) accuracy. MaCS improves over baseline in the large majority of settings, with the largest gains on CIFAR and Food-101.
  • Figure 4: Negative log-likelihood comparison for ResNet-50 on CIFAR-10/100. MaCS improves NLL relative to baseline but does not always outperform the strongest calibration baselines.
  • Figure 5: Corruption robustness on CIFAR-10-C and CIFAR-100-C (ResNet-50). MaCS consistently outperforms all baselines including Mixup.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Definition 4.1: Spectral Complexity
  • Theorem 4.2: Margin-Based Generalization bartlett2017spectrally
  • Definition 4.3: Local Sensitivity
  • Remark 4.4: Consistency Controls Sensitivity
  • Theorem 4.5: Margin-Stability Robustness Radius
  • proof
  • Corollary 4.6: Radius Under Lipschitz Logits
  • proof
  • Remark 4.7: On the Theory-Practice Gap