Table of Contents
Fetching ...

Revisiting Kernel Attention with Correlated Gaussian Process Representation

Long Minh Bui, Tho Tran Huu, Duy Dinh, Tan Minh Nguyen, Trong Nghia Hoang

TL;DR

This work tackles uncertainty calibration in transformer attention by removing the symmetry constraint that plagues prior GP-based methods. It introduces the Correlated Gaussian Process Transformer (CGPT), which models attention as the cross-covariance between two correlated GPs, enabling asymmetric kernels while preserving uncertainty quantification. A sparse version (SCGPT) based on Deterministic Training Conditional is developed to scale to longer sequences, accompanied by a CGP regularization loss that trades off predictive uncertainty against task performance. Empirical results across image classification and linguistic acceptability demonstrate improved calibration, robustness to distribution shift, and favorable efficiency compared with state-of-the-art GP-based approaches, signaling practical benefits for robust transformer deployment.

Abstract

Transformers have increasingly become the de facto method to model sequential data with state-of-the-art performance. Due to its widespread use, being able to estimate and calibrate its modeling uncertainty is important to understand and design robust transformer models. To achieve this, previous works have used Gaussian processes (GPs) to perform uncertainty calibration for the attention units of transformers and attained notable successes. However, such approaches have to confine the transformers to the space of symmetric attention to ensure the necessary symmetric requirement of their GP's kernel specification, which reduces the representation capacity of the model. To mitigate this restriction, we propose the Correlated Gaussian Process Transformer (CGPT), a new class of transformers whose self-attention units are modeled as cross-covariance between two correlated GPs (CGPs). This allows asymmetries in attention and can enhance the representation capacity of GP-based transformers. We also derive a sparse approximation for CGP to make it scale better. Our empirical studies show that both CGP-based and sparse CGP-based transformers achieve better performance than state-of-the-art GP-based transformers on a variety of benchmark tasks. The code for our experiments is available at https://github.com/MinhLong210/CGP-Transformers.

Revisiting Kernel Attention with Correlated Gaussian Process Representation

TL;DR

This work tackles uncertainty calibration in transformer attention by removing the symmetry constraint that plagues prior GP-based methods. It introduces the Correlated Gaussian Process Transformer (CGPT), which models attention as the cross-covariance between two correlated GPs, enabling asymmetric kernels while preserving uncertainty quantification. A sparse version (SCGPT) based on Deterministic Training Conditional is developed to scale to longer sequences, accompanied by a CGP regularization loss that trades off predictive uncertainty against task performance. Empirical results across image classification and linguistic acceptability demonstrate improved calibration, robustness to distribution shift, and favorable efficiency compared with state-of-the-art GP-based approaches, signaling practical benefits for robust transformer deployment.

Abstract

Transformers have increasingly become the de facto method to model sequential data with state-of-the-art performance. Due to its widespread use, being able to estimate and calibrate its modeling uncertainty is important to understand and design robust transformer models. To achieve this, previous works have used Gaussian processes (GPs) to perform uncertainty calibration for the attention units of transformers and attained notable successes. However, such approaches have to confine the transformers to the space of symmetric attention to ensure the necessary symmetric requirement of their GP's kernel specification, which reduces the representation capacity of the model. To mitigate this restriction, we propose the Correlated Gaussian Process Transformer (CGPT), a new class of transformers whose self-attention units are modeled as cross-covariance between two correlated GPs (CGPs). This allows asymmetries in attention and can enhance the representation capacity of GP-based transformers. We also derive a sparse approximation for CGP to make it scale better. Our empirical studies show that both CGP-based and sparse CGP-based transformers achieve better performance than state-of-the-art GP-based transformers on a variety of benchmark tasks. The code for our experiments is available at https://github.com/MinhLong210/CGP-Transformers.

Paper Structure

This paper contains 33 sections, 64 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure I: Diagram of the training workflow of CGPT. Each attention block forwards the CGP's prediction to the next block and caches the prediction uncertainty into a CGP regularizing term (see Algorithm \ref{['alg:cap']}). Once the attention output is propagated to the last classification block, the original transformer loss is computed and augmented with the CGP regularizing term. Gradient propagation from this augmented loss will help optimize the CGP parameters to reduce prediction uncertainty while maximizing predictive performance.
  • Figure II: Runtime per training epoch (right) and GPU memory allocated during training (left) of SCGPT and SGPA on CIFAR10. SCGPT is more efficient than SGPA in terms of GPU memory usage while having a comparable runtime per epoch to SGPA.
  • Figure III: The cosine similarity between the token representations vs. the layer index of CGPT and SGPA on CIFAR10. CGPT is much less vulnerable to oversmoothing compared to SGPA.
  • Figure IV: The cosine similarity between the token representations after the attention calculation vs. the layer index of CGPT and SGPA on CIFAR100. CGPT is much less vulnerable to oversmoothing compared to SGPA.

Theorems & Definitions (5)

  • Definition 1: Canonical Gaussian process (GP)
  • Remark 1: CGP-based Attention can be Asymmetric.
  • Remark 2: One CGP per Attention Dimension
  • Remark 3
  • Remark 4