Table of Contents
Fetching ...

Revisiting Transformers with Insights from Image Filtering and Boosting

Laziz U. Abdullaev, Maksim Tkachenko, Tan M. Nguyen

TL;DR

The paper develops a unifying image-processing framework that interprets Transformer self-attention as an image-filtering operation, connecting attention to bilateral filtering and nonlocal means, and shows how positional encoding and residual connections can be understood within this lens. It introduces Bilateral Self-Attention and a boosting residual scheme, and provides theoretical results on long-range stability and robustness, along with empirical gains across language and vision tasks. The framework explains why certain positional encodings help long-range modeling and argues for a more robust information flow through boosting-based residuals. Empirically, Bilateral Attention with boosting improves performance and adversarial robustness on WikiText-103, LRA benchmarks, ImageNet-1k, and ADE20k, while offering interpretable attention patterns. The work suggests a principled, interpretable direction for Transformer design, bridging classical image-processing techniques with modern sequence and vision models.

Abstract

The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.

Revisiting Transformers with Insights from Image Filtering and Boosting

TL;DR

The paper develops a unifying image-processing framework that interprets Transformer self-attention as an image-filtering operation, connecting attention to bilateral filtering and nonlocal means, and shows how positional encoding and residual connections can be understood within this lens. It introduces Bilateral Self-Attention and a boosting residual scheme, and provides theoretical results on long-range stability and robustness, along with empirical gains across language and vision tasks. The framework explains why certain positional encodings help long-range modeling and argues for a more robust information flow through boosting-based residuals. Empirically, Bilateral Attention with boosting improves performance and adversarial robustness on WikiText-103, LRA benchmarks, ImageNet-1k, and ADE20k, while offering interpretable attention patterns. The work suggests a principled, interpretable direction for Transformer design, bridging classical image-processing techniques with modern sequence and vision models.

Abstract

The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.

Paper Structure

This paper contains 26 sections, 9 theorems, 62 equations, 6 figures, 5 tables.

Key Result

Theorem 3.1

Consider a 1-layer transformer with a single attention head with identity projections $\bm W_{M : M \in \{Q,K,V\}} = \bm I$. Then, output vectors of its self-attention mechanism, as given by Eqn. eqn: self-attention, are weighted least squares estimates of clean patches $\bm{u}_i$, given noisy patch

Figures (6)

  • Figure 1: Left to Right: Token-to-Token, Token-to-Position, Position-to-Token, and Position-to-Position Correlation Logit Matrices ke2021rethinking.
  • Figure 1: LRA tay2020long learning curves of baselinevaswani2017attention and bilateral attention mechanisms for Image (1K) and Text (4K) datasets. The figures show that the vanilla self-attention either learns substantially slower compared to its bilateral counterpart or struggles to learn useful long context relations.
  • Figure 2: Left figure: blue circles indicate the estimated tight Lipschitz constant of $\mathrm{softmax}$ as in Remark \ref{['rmk:lipschitz']} while the red curve corresponds to the best fit curve of the form $\frac{a}{\sqrt{N}}+b$. Right figure: empirical estimate of the expectation mentioned in Theorem \ref{['thm:softmax-unstability']} using a random source vector $\bm c \in \mathbb{R}^N$ with $\eta_i \sim \mathcal{N}(0, 1)$, for input dimensionality $10^2 \le N \le 10^4$ using $10^3$ random samples. One can confirm that the perturbation magnitude of attention scores does not necessarily vanish with large $N$, supporting the claim of Theorem \ref{['thm:softmax-unstability']}.
  • Figure 2: Object Classification on ImageNet-1k under FGSM and PGD attacks with increasing perturbation budgets (severity level $\times$ 1/255). The figures show that models that employ boosting by GRC are consistently more robust than their counterparts without GRC. The vertical axis is in log-scale.
  • Figure 3: Left to Right: Input, DeiT touvron2021training, and Deit-Bilateral (ours). Attention heatmaps averaged over attention heads for DeiT touvron2021training and DeiT-Bilateral. It can be observed that the heatmaps produced by DeiT-Bilateral captures extended meaningful regions of input samples.
  • ...and 1 more figures

Theorems & Definitions (18)

  • Definition 2.1: Inequality up to a Constant Factor
  • Definition 2.2: Asymptotic Equivalence
  • Theorem 3.1: $1$-layer Transformer
  • Remark 3.1
  • Definition 3.1
  • Proposition 3.1: Residual Connections Boost SNR
  • Remark 3.2
  • Remark 3.3
  • Corollary 3.1: $n$-layer Transformer
  • Proposition 3.2
  • ...and 8 more