Diffusion-Inspired Reconfiguration of Transformers for Uncertainty Calibration

Manh Cuong Dao; Quang Hung Pham; Phi Le Nguyen; Thao Nguyen Truong; Bryan Kian Hsiang Low; Trong Nghia Hoang

Diffusion-Inspired Reconfiguration of Transformers for Uncertainty Calibration

Manh Cuong Dao, Quang Hung Pham, Phi Le Nguyen, Thao Nguyen Truong, Bryan Kian Hsiang Low, Trong Nghia Hoang

TL;DR

The paper tackles uncertainty calibration in large pretrained transformers by introducing a diffusion-inspired reconfiguration that treats the sequence of feature transformations as a probabilistic path. It distills this path into a unified reverse-time diffusion process with a single spatiotemporal transition kernel, training via a KL-based objective and a performance-guided loss to preserve accuracy. The proposed DIRECTOR framework reparameterizes transformer blocks into neuralized Gaussian transitions, enabling principled uncertainty propagation with improved calibration, robustness to distribution shifts, and better OOD detection, while achieving memory efficiency due to a smaller unified transition model. Empirical results across vision and language benchmarks (e.g., CIFAR-10/100, IMDB, CoLA, CIFAR-10-C) show state-of-the-art calibration and strong predictive performance, often outperforming GP-reparameterized baselines and vanilla transformers, with statistical significance in many settings. This diffusion-based reconfiguration offers a scalable, interpretable route to integrate probabilistic reasoning into foundation models, enhancing reliability in safety-critical applications.

Abstract

Uncertainty calibration in pre-trained transformers is critical for their reliable deployment in risk-sensitive applications. Yet, most existing pre-trained transformers do not have a principled mechanism for uncertainty propagation through their feature transformation stack. In this work, we propose a diffusion-inspired reconfiguration of transformers in which each feature transformation block is modeled as a probabilistic mapping. Composing these probabilistic mappings reveals a probability path that mimics the structure of a diffusion process, transporting data mass from the input distribution to the pre-trained feature distribution. This probability path can then be recompiled on a diffusion process with a unified transition model to enable principled propagation of representation uncertainty throughout the pre-trained model's architecture while maintaining its original predictive performance. Empirical results across a variety of vision and language benchmarks demonstrate that our method achieves superior calibration and predictive accuracy compared to existing uncertainty-aware transformers.

Diffusion-Inspired Reconfiguration of Transformers for Uncertainty Calibration

TL;DR

Abstract

Paper Structure (30 sections, 34 equations, 4 figures, 18 tables)

This paper contains 30 sections, 34 equations, 4 figures, 18 tables.

Introduction
Diffusion-Inspired Reconfiguration of Transformers
Reconfiguring Pre-Trained Transformer as Probability Path
Distilling Transformer-based Probability Path on Diffusion Model
Experiments
Experiment Settings
Results and Discussion
In-Distribution Classification
Distribution Shift Robustness
Out-of-Distribution Detection
Related Work
Conclusion
Appendix / supplemental material
Multi-Head Self-Attention
Self-Attention as Gaussian Process Inference
...and 15 more sections

Figures (4)

Figure 1: Comparison of accuracy (ACC$\uparrow$) and uncertainty calibration (ECE$\downarrow$) across pretrained models (ViT, Transformer), GP-reparameterized method KEP chen2024self applied to either the last attention block (KEP-last) or all attention blocks (KEP-All), and our method (DIRECTOR) on (a) CIFAR-10 and (b) CoLA dataset. Panel (c) compares the correlation between features at the first layer ($\mathbf{X}_6$) and and those at deeper layers ($\mathbf{X}_4$, $\mathbf{X}_2$, and $\mathbf{X}_0$ at the last layer) for DIRECTOR and KEP-All on CIFAR-10 dataset.
Figure 2: Restructuring a pre-trained transformer such that each block outputs a Gaussian distribution over its intermediate features, effectively aligning its architecture with a probabilistic path.
Figure 3: Calibration comparison of pre-trained models with their corresponding diffusion-based reconfigured produced by DIRECTOR on CIFAR-10-C over 5 severity levels of corruption. The notation $\texttt{S-k}$ represents the severity level $k$. DIRECTOR achieves competitive accuracy and outperforms pre-trained models in most calibration metrics.
Figure 4: Training curves comparison between KEP-7/7 and ViT models. (a) and (b) show the loss curves, while (c) and (d) show the validation accuracy over training steps.

Diffusion-Inspired Reconfiguration of Transformers for Uncertainty Calibration

TL;DR

Abstract

Diffusion-Inspired Reconfiguration of Transformers for Uncertainty Calibration

Authors

TL;DR

Abstract

Table of Contents

Figures (4)