Likelihood-guided Regularization in Attention Based Models
Mohamed Salem, Inyoung Kim
TL;DR
This work tackles the challenge of regularizing Vision Transformers (ViTs) to improve generalization and enable adaptive architecture search in data-limited regimes. It introduces likelihood-guided variational Ising-based regularization (Ising-ViT), combining Bayesian sparsification with an Ising prior to learn structured pruning patterns and uncertainty over weights and pruning decisions. The approach yields uncertainty-aware attention and calibrated probability estimates while maintaining competitive accuracy across MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100. Empirical results demonstrate that the Ising regularizer supports principled, data-driven sparsification, interpretable posterior masks, and robust uncertainty quantification without requiring post hoc pruning or heuristic stochasticity settings.
Abstract
The transformer architecture has demonstrated strong performance in classification tasks involving structured and high-dimensional data. However, its success often hinges on large- scale training data and careful regularization to prevent overfitting. In this paper, we intro- duce a novel likelihood-guided variational Ising-based regularization framework for Vision Transformers (ViTs), which simultaneously enhances model generalization and dynamically prunes redundant parameters. The proposed variational Ising-based regularization approach leverages Bayesian sparsification techniques to impose structured sparsity on model weights, allowing for adaptive architecture search during training. Unlike traditional dropout-based methods, which enforce fixed sparsity patterns, the variational Ising-based regularization method learns task-adaptive regularization, improving both efficiency and interpretability. We evaluate our approach on benchmark vision datasets, including MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100, demonstrating improved generalization under sparse, complex data and allowing for principled uncertainty quantification on both weights and selection parameters. Additionally, we show that the Ising regularizer leads to better-calibrated probability estimates and structured feature selection through uncertainty-aware attention mechanisms. Our results highlight the effectiveness of structured Bayesian sparsification in enhancing transformer-based architectures, offering a principled alternative to standard regularization techniques.
