Table of Contents
Fetching ...

Calibration Attention: Learning Reliability-Aware Representations for Vision Transformers

Wenhao Liang, Wei Emma Zhang, Lin Yue, Miao Xu, Mingyu Guo, Olaf Maennel, Weitong Chen

TL;DR

Treating calibration as a representation-level problem is a practical and effective direction for trustworthy uncertainty estimation in transformers and indicates that treating calibration as a representation-level problem is a practical and effective direction for trustworthy uncertainty estimation in transformers.

Abstract

Most calibration methods operate at the logit level, implicitly assuming that miscalibration can be corrected without changing the underlying representation. We challenge this assumption and propose \textbf{Calibration Attention (CalAttn)}, a \emph{representation-aware} calibration module for vision transformers that couples instance-wise temperature scaling to transformer token geometry under a proper scoring objective. CalAttn predicts a sample-specific temperature from the \texttt{[CLS]} token and backpropagates calibration gradients into the backbone, thereby reshaping the uncertainty structure of the representation rather than post-hoc adjusting confidence. This yields \emph{token-conditioned uncertainty modulation} with negligible overhead (\(<0.1\%\) additional parameters). Across multiple datasets with ViT/DeiT/Swin backbones, CalAttn consistently improves calibration while preserving accuracy, achieving relative ECE reductions of \(3.7\%\) to \(77.7\%\) over strong baselines across diverse training objectives. Our results indicate that treating calibration as a representation-level problem is a practical and effective direction for trustworthy uncertainty estimation in transformers. Code: [https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-](https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-)

Calibration Attention: Learning Reliability-Aware Representations for Vision Transformers

TL;DR

Treating calibration as a representation-level problem is a practical and effective direction for trustworthy uncertainty estimation in transformers and indicates that treating calibration as a representation-level problem is a practical and effective direction for trustworthy uncertainty estimation in transformers.

Abstract

Most calibration methods operate at the logit level, implicitly assuming that miscalibration can be corrected without changing the underlying representation. We challenge this assumption and propose \textbf{Calibration Attention (CalAttn)}, a \emph{representation-aware} calibration module for vision transformers that couples instance-wise temperature scaling to transformer token geometry under a proper scoring objective. CalAttn predicts a sample-specific temperature from the \texttt{[CLS]} token and backpropagates calibration gradients into the backbone, thereby reshaping the uncertainty structure of the representation rather than post-hoc adjusting confidence. This yields \emph{token-conditioned uncertainty modulation} with negligible overhead ( additional parameters). Across multiple datasets with ViT/DeiT/Swin backbones, CalAttn consistently improves calibration while preserving accuracy, achieving relative ECE reductions of to over strong baselines across diverse training objectives. Our results indicate that treating calibration as a representation-level problem is a practical and effective direction for trustworthy uncertainty estimation in transformers. Code: [https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-](https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-)

Paper Structure

This paper contains 73 sections, 28 equations, 7 figures, 18 tables, 1 algorithm.

Figures (7)

  • Figure 1: Vision Transformer with Calibration Attention. A lightweight calibration head reads the final [CLS] token and predicts a strictly positive, per-sample temperature $s(\mathbf z)$. The class logits produced by the standard classifier head are divided by this scale prior to the softmax, enabling representation-conditioned cooling or sharpening of predictive confidence. CalAttn learns reliability-aware representations by conditioning confidence on the [CLS] embedding, replacing global post-hoc temperature scaling.
  • Figure 2: [CLS]$\ell_2$-norm versus maximum softmax probability on CIFAR-10/100. Blue dots are samples; red curves show bin means. Correlation is moderate and non-deterministic, motivating learning a representation-aware calibration function.
  • Figure 3: ECE(%) on CIFAR-10/100 across ViT-224, DeiT-S, and Swin-S with relative changes ($\Delta$ %, “$+$” increase, “$-$” decrease).
  • Figure 4: Temporal dynamics of CalAttn on ViT-224 (CIFAR-10).Top: mean effective temperature $\tau(x)=1/s(x)$ (red) and its coefficient of variation $\mathrm{CV}_{\tau}$ (blue) over training. The dashed line marks the optimal global temperature $T^\star$ obtained by post-hoc temperature scaling. Bottom: kernel density estimates of $\tau(x)$ and $\lVert z_{\mathrm{CLS}}\rVert_2$ at three epochs.
  • Figure 5: Reliability diagrams before temperature scaling (CIFAR-10/100, 300 epochs).
  • ...and 2 more figures