Diffusion-Inspired Reconfiguration of Transformers for Uncertainty Calibration
Manh Cuong Dao, Quang Hung Pham, Phi Le Nguyen, Thao Nguyen Truong, Bryan Kian Hsiang Low, Trong Nghia Hoang
TL;DR
The paper tackles uncertainty calibration in large pretrained transformers by introducing a diffusion-inspired reconfiguration that treats the sequence of feature transformations as a probabilistic path. It distills this path into a unified reverse-time diffusion process with a single spatiotemporal transition kernel, training via a KL-based objective and a performance-guided loss to preserve accuracy. The proposed DIRECTOR framework reparameterizes transformer blocks into neuralized Gaussian transitions, enabling principled uncertainty propagation with improved calibration, robustness to distribution shifts, and better OOD detection, while achieving memory efficiency due to a smaller unified transition model. Empirical results across vision and language benchmarks (e.g., CIFAR-10/100, IMDB, CoLA, CIFAR-10-C) show state-of-the-art calibration and strong predictive performance, often outperforming GP-reparameterized baselines and vanilla transformers, with statistical significance in many settings. This diffusion-based reconfiguration offers a scalable, interpretable route to integrate probabilistic reasoning into foundation models, enhancing reliability in safety-critical applications.
Abstract
Uncertainty calibration in pre-trained transformers is critical for their reliable deployment in risk-sensitive applications. Yet, most existing pre-trained transformers do not have a principled mechanism for uncertainty propagation through their feature transformation stack. In this work, we propose a diffusion-inspired reconfiguration of transformers in which each feature transformation block is modeled as a probabilistic mapping. Composing these probabilistic mappings reveals a probability path that mimics the structure of a diffusion process, transporting data mass from the input distribution to the pre-trained feature distribution. This probability path can then be recompiled on a diffusion process with a unified transition model to enable principled propagation of representation uncertainty throughout the pre-trained model's architecture while maintaining its original predictive performance. Empirical results across a variety of vision and language benchmarks demonstrate that our method achieves superior calibration and predictive accuracy compared to existing uncertainty-aware transformers.
