Table of Contents
Fetching ...

Diffusion-Inspired Reconfiguration of Transformers for Uncertainty Calibration

Manh Cuong Dao, Quang Hung Pham, Phi Le Nguyen, Thao Nguyen Truong, Bryan Kian Hsiang Low, Trong Nghia Hoang

TL;DR

The paper tackles uncertainty calibration in large pretrained transformers by introducing a diffusion-inspired reconfiguration that treats the sequence of feature transformations as a probabilistic path. It distills this path into a unified reverse-time diffusion process with a single spatiotemporal transition kernel, training via a KL-based objective and a performance-guided loss to preserve accuracy. The proposed DIRECTOR framework reparameterizes transformer blocks into neuralized Gaussian transitions, enabling principled uncertainty propagation with improved calibration, robustness to distribution shifts, and better OOD detection, while achieving memory efficiency due to a smaller unified transition model. Empirical results across vision and language benchmarks (e.g., CIFAR-10/100, IMDB, CoLA, CIFAR-10-C) show state-of-the-art calibration and strong predictive performance, often outperforming GP-reparameterized baselines and vanilla transformers, with statistical significance in many settings. This diffusion-based reconfiguration offers a scalable, interpretable route to integrate probabilistic reasoning into foundation models, enhancing reliability in safety-critical applications.

Abstract

Uncertainty calibration in pre-trained transformers is critical for their reliable deployment in risk-sensitive applications. Yet, most existing pre-trained transformers do not have a principled mechanism for uncertainty propagation through their feature transformation stack. In this work, we propose a diffusion-inspired reconfiguration of transformers in which each feature transformation block is modeled as a probabilistic mapping. Composing these probabilistic mappings reveals a probability path that mimics the structure of a diffusion process, transporting data mass from the input distribution to the pre-trained feature distribution. This probability path can then be recompiled on a diffusion process with a unified transition model to enable principled propagation of representation uncertainty throughout the pre-trained model's architecture while maintaining its original predictive performance. Empirical results across a variety of vision and language benchmarks demonstrate that our method achieves superior calibration and predictive accuracy compared to existing uncertainty-aware transformers.

Diffusion-Inspired Reconfiguration of Transformers for Uncertainty Calibration

TL;DR

The paper tackles uncertainty calibration in large pretrained transformers by introducing a diffusion-inspired reconfiguration that treats the sequence of feature transformations as a probabilistic path. It distills this path into a unified reverse-time diffusion process with a single spatiotemporal transition kernel, training via a KL-based objective and a performance-guided loss to preserve accuracy. The proposed DIRECTOR framework reparameterizes transformer blocks into neuralized Gaussian transitions, enabling principled uncertainty propagation with improved calibration, robustness to distribution shifts, and better OOD detection, while achieving memory efficiency due to a smaller unified transition model. Empirical results across vision and language benchmarks (e.g., CIFAR-10/100, IMDB, CoLA, CIFAR-10-C) show state-of-the-art calibration and strong predictive performance, often outperforming GP-reparameterized baselines and vanilla transformers, with statistical significance in many settings. This diffusion-based reconfiguration offers a scalable, interpretable route to integrate probabilistic reasoning into foundation models, enhancing reliability in safety-critical applications.

Abstract

Uncertainty calibration in pre-trained transformers is critical for their reliable deployment in risk-sensitive applications. Yet, most existing pre-trained transformers do not have a principled mechanism for uncertainty propagation through their feature transformation stack. In this work, we propose a diffusion-inspired reconfiguration of transformers in which each feature transformation block is modeled as a probabilistic mapping. Composing these probabilistic mappings reveals a probability path that mimics the structure of a diffusion process, transporting data mass from the input distribution to the pre-trained feature distribution. This probability path can then be recompiled on a diffusion process with a unified transition model to enable principled propagation of representation uncertainty throughout the pre-trained model's architecture while maintaining its original predictive performance. Empirical results across a variety of vision and language benchmarks demonstrate that our method achieves superior calibration and predictive accuracy compared to existing uncertainty-aware transformers.
Paper Structure (30 sections, 34 equations, 4 figures, 18 tables)

This paper contains 30 sections, 34 equations, 4 figures, 18 tables.

Figures (4)

  • Figure 1: Comparison of accuracy (ACC$\uparrow$) and uncertainty calibration (ECE$\downarrow$) across pretrained models (ViT, Transformer), GP-reparameterized method KEP chen2024self applied to either the last attention block (KEP-last) or all attention blocks (KEP-All), and our method (DIRECTOR) on (a) CIFAR-10 and (b) CoLA dataset. Panel (c) compares the correlation between features at the first layer ($\mathbf{X}_6$) and and those at deeper layers ($\mathbf{X}_4$, $\mathbf{X}_2$, and $\mathbf{X}_0$ at the last layer) for DIRECTOR and KEP-All on CIFAR-10 dataset.
  • Figure 2: Restructuring a pre-trained transformer such that each block outputs a Gaussian distribution over its intermediate features, effectively aligning its architecture with a probabilistic path.
  • Figure 3: Calibration comparison of pre-trained models with their corresponding diffusion-based reconfigured produced by DIRECTOR on CIFAR-10-C over 5 severity levels of corruption. The notation $\texttt{S-k}$ represents the severity level $k$. DIRECTOR achieves competitive accuracy and outperforms pre-trained models in most calibration metrics.
  • Figure 4: Training curves comparison between KEP-7/7 and ViT models. (a) and (b) show the loss curves, while (c) and (d) show the validation accuracy over training steps.