Revisiting Transformers with Insights from Image Filtering and Boosting
Laziz U. Abdullaev, Maksim Tkachenko, Tan M. Nguyen
TL;DR
The paper develops a unifying image-processing framework that interprets Transformer self-attention as an image-filtering operation, connecting attention to bilateral filtering and nonlocal means, and shows how positional encoding and residual connections can be understood within this lens. It introduces Bilateral Self-Attention and a boosting residual scheme, and provides theoretical results on long-range stability and robustness, along with empirical gains across language and vision tasks. The framework explains why certain positional encodings help long-range modeling and argues for a more robust information flow through boosting-based residuals. Empirically, Bilateral Attention with boosting improves performance and adversarial robustness on WikiText-103, LRA benchmarks, ImageNet-1k, and ADE20k, while offering interpretable attention patterns. The work suggests a principled, interpretable direction for Transformer design, bridging classical image-processing techniques with modern sequence and vision models.
Abstract
The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.
