Table of Contents
Fetching ...

Likelihood-guided Regularization in Attention Based Models

Mohamed Salem, Inyoung Kim

TL;DR

This work tackles the challenge of regularizing Vision Transformers (ViTs) to improve generalization and enable adaptive architecture search in data-limited regimes. It introduces likelihood-guided variational Ising-based regularization (Ising-ViT), combining Bayesian sparsification with an Ising prior to learn structured pruning patterns and uncertainty over weights and pruning decisions. The approach yields uncertainty-aware attention and calibrated probability estimates while maintaining competitive accuracy across MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100. Empirical results demonstrate that the Ising regularizer supports principled, data-driven sparsification, interpretable posterior masks, and robust uncertainty quantification without requiring post hoc pruning or heuristic stochasticity settings.

Abstract

The transformer architecture has demonstrated strong performance in classification tasks involving structured and high-dimensional data. However, its success often hinges on large- scale training data and careful regularization to prevent overfitting. In this paper, we intro- duce a novel likelihood-guided variational Ising-based regularization framework for Vision Transformers (ViTs), which simultaneously enhances model generalization and dynamically prunes redundant parameters. The proposed variational Ising-based regularization approach leverages Bayesian sparsification techniques to impose structured sparsity on model weights, allowing for adaptive architecture search during training. Unlike traditional dropout-based methods, which enforce fixed sparsity patterns, the variational Ising-based regularization method learns task-adaptive regularization, improving both efficiency and interpretability. We evaluate our approach on benchmark vision datasets, including MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100, demonstrating improved generalization under sparse, complex data and allowing for principled uncertainty quantification on both weights and selection parameters. Additionally, we show that the Ising regularizer leads to better-calibrated probability estimates and structured feature selection through uncertainty-aware attention mechanisms. Our results highlight the effectiveness of structured Bayesian sparsification in enhancing transformer-based architectures, offering a principled alternative to standard regularization techniques.

Likelihood-guided Regularization in Attention Based Models

TL;DR

This work tackles the challenge of regularizing Vision Transformers (ViTs) to improve generalization and enable adaptive architecture search in data-limited regimes. It introduces likelihood-guided variational Ising-based regularization (Ising-ViT), combining Bayesian sparsification with an Ising prior to learn structured pruning patterns and uncertainty over weights and pruning decisions. The approach yields uncertainty-aware attention and calibrated probability estimates while maintaining competitive accuracy across MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100. Empirical results demonstrate that the Ising regularizer supports principled, data-driven sparsification, interpretable posterior masks, and robust uncertainty quantification without requiring post hoc pruning or heuristic stochasticity settings.

Abstract

The transformer architecture has demonstrated strong performance in classification tasks involving structured and high-dimensional data. However, its success often hinges on large- scale training data and careful regularization to prevent overfitting. In this paper, we intro- duce a novel likelihood-guided variational Ising-based regularization framework for Vision Transformers (ViTs), which simultaneously enhances model generalization and dynamically prunes redundant parameters. The proposed variational Ising-based regularization approach leverages Bayesian sparsification techniques to impose structured sparsity on model weights, allowing for adaptive architecture search during training. Unlike traditional dropout-based methods, which enforce fixed sparsity patterns, the variational Ising-based regularization method learns task-adaptive regularization, improving both efficiency and interpretability. We evaluate our approach on benchmark vision datasets, including MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100, demonstrating improved generalization under sparse, complex data and allowing for principled uncertainty quantification on both weights and selection parameters. Additionally, we show that the Ising regularizer leads to better-calibrated probability estimates and structured feature selection through uncertainty-aware attention mechanisms. Our results highlight the effectiveness of structured Bayesian sparsification in enhancing transformer-based architectures, offering a principled alternative to standard regularization techniques.

Paper Structure

This paper contains 13 sections, 2 theorems, 18 equations, 58 figures, 8 tables, 1 algorithm.

Key Result

Proposition 4.1

For any input $\mathbf{X}$, we have the following properties of our likelihood-guided regularization approach:

Figures (58)

  • Figure 1: The figure shows the components in the likelihood-guided regularization method. The selection variables $\xi$ determine which distribution in the spike-slab prior a weight, $w$, is sampled from.
  • Figure 2: The figure displays the backpropagation of masks alongside the backpropagation of weights in an alternating process illustrated on a simplified vision transformer.
  • Figure 3: Visualization of accuracy, recall, precision, F1-score, and false positive rate obtained using the MNIST dataset.
  • Figure 4: Visualization of accuracy, recall, precision, F1-score, and false positive rate obtained using the Fashion-MNIST dataset.
  • Figure 5: Visualization ofaccuracy, recall, precision, F1-score, and false positive rate obtained using the Cifar10 dataset.
  • ...and 53 more figures

Theorems & Definitions (2)

  • Proposition 4.1
  • Proposition 4.2