Interpreting and Improving Attention From the Perspective of Large Kernel Convolution

Chenghao Li; Chaoning Zhang; Boheng Zeng; Yi Lu; Pengbo Shi; Qingzi Chen; Jirui Liu; Lingyun Zhu; Yang Yang; Heng Tao Shen

Interpreting and Improving Attention From the Perspective of Large Kernel Convolution

Chenghao Li, Chaoning Zhang, Boheng Zeng, Yi Lu, Pengbo Shi, Qingzi Chen, Jirui Liu, Lingyun Zhu, Yang Yang, Heng Tao Shen

TL;DR

The paper tackles data- and resource-constrained visual modeling by reframing attention as a single large-kernel convolution, thereby combining CNN locality with ViT-style global context. The proposed LKCA module replaces MHSA with a large kernel and introduces a shared-weight positional mechanism, enabling efficient, parameter-shared attention that preserves spatial inductive biases. Empirical results across CIFAR-10/100, SVHN, Tiny-ImageNet, and ADE20K demonstrate consistent improvements over standard ViT baselines and competitive performance with fewer parameters, particularly in small- to mid-sized models. This approach offers a practical, robust solution for real-world scenarios with limited data and compute, bridging the gap between CNNs and ViTs for both classification and segmentation tasks.

Abstract

Attention mechanisms have significantly advanced visual models by capturing global context effectively. However, their reliance on large-scale datasets and substantial computational resources poses challenges in data-scarce and resource-constrained scenarios. Moreover, traditional self-attention mechanisms lack inherent spatial inductive biases, making them suboptimal for modeling local features critical to tasks involving smaller datasets. In this work, we introduce Large Kernel Convolutional Attention (LKCA), a novel formulation that reinterprets attention operations as a single large-kernel convolution. This design unifies the strengths of convolutional architectures locality and translation invariance with the global context modeling capabilities of self-attention. By embedding these properties into a computationally efficient framework, LKCA addresses key limitations of traditional attention mechanisms. The proposed LKCA achieves competitive performance across various visual tasks, particularly in data-constrained settings. Experimental results on CIFAR-10, CIFAR-100, SVHN, and Tiny-ImageNet demonstrate its ability to excel in image classification, outperforming conventional attention mechanisms and vision transformers in compact model settings. These findings highlight the effectiveness of LKCA in bridging local and global feature modeling, offering a practical and robust solution for real-world applications with limited data and resources.

Interpreting and Improving Attention From the Perspective of Large Kernel Convolution

TL;DR

Abstract

Paper Structure (19 sections, 2 equations, 2 figures, 7 tables, 2 algorithms)

This paper contains 19 sections, 2 equations, 2 figures, 7 tables, 2 algorithms.

Introduction
Related Works
Attention Mechanisms in Visual Models
ConvNets With Large Kernel
ViTs with Large ERF
Approach
Preliminary
Review of Large Kernel Convolution
Comparision of LKC and MHSA
Large Kernel Convolutional Attention (LKCA)
Shared Weight Position Operation
From the Perspective of Convolution: Implementing LKCA
Overall Architecture
Experiments
Experimental Setup
...and 4 more sections

Figures (2)

Figure 1: Two views to interpret LKCA. The Large Kernel Convolutional Attention (LKCA) can be understood From the perspective of convolution on the left and attention on the right. The effects of the two approaches are equivalent.
Figure 2: Participation differences in kernel convolutions. Illustration of the difference between small kernel convolution and large kernel convolution by constructing a 5x5 feature map, a 3x3 convolution kernel smaller than the feature map, and a 7x7 convolution kernel larger than the feature map. The distinction between small kernel convolution and large kernel convolution lies in the fact that, in small kernel convolution, all parameters of the kernel are involved in each correlation operation, while only a subset of feature map parameters participates in the computation. In the case of large kernel convolution, during each correlation operation, all parameters of the feature map are involved, and only a subset of the convolution kernel parameters participates in the computation.

Interpreting and Improving Attention From the Perspective of Large Kernel Convolution

TL;DR

Abstract

Interpreting and Improving Attention From the Perspective of Large Kernel Convolution

Authors

TL;DR

Abstract

Table of Contents

Figures (2)