Partial Convolution Meets Visual Attention
Haiduo Huang, Fuwei Yang, Dong Li, Ji Liu, Lu Tian, Jinzhang Peng, Pengju Ren, Emad Barsoum
TL;DR
This work introduces Partial Visual Attention (PAT), a framework that fuses visual attention with partial convolution to overcome the accuracy limitations of existing partial convolutions while preserving speed. It presents three efficient attention blocks—PAT_ch (channel attention), PAT_sp (spatial attention), and PAT_sf (self-attention)—and a four-stage PATNet architecture that leverages these blocks, including a last-stage self-attention enhancement with relative position encoding. Empirically, PATNet outperforms FasterNet on ImageNet-1K across model sizes and delivers higher throughput with lower latency, while also improving COCO detection and segmentation when used as a backbone. The results demonstrate that partial attention can replace full attention with favorable speed-accuracy trade-offs, suggesting broader potential for accelerating vision models and perhaps NLP/LLMs in the future.
Abstract
Designing an efficient and effective neural network has remained a prominent topic in computer vision research. Depthwise onvolution (DWConv) is widely used in efficient CNNs or ViTs, but it needs frequent memory access during inference, which leads to low throughput. FasterNet attempts to introduce partial convolution (PConv) as an alternative to DWConv but compromises the accuracy due to underutilized channels. To remedy this shortcoming and consider the redundancy between feature map channels, we introduce a novel Partial visual ATtention mechanism (PAT) that can efficiently combine PConv with visual attention. Our exploration indicates that the partial attention mechanism can completely replace the full attention mechanism and reduce model parameters and FLOPs. Our PAT can derive three types of blocks: Partial Channel-Attention block (PAT_ch), Partial Spatial-Attention block (PAT_sp) and Partial Self-Attention block (PAT_sf). First, PAT_ch integrates the enhanced Gaussian channel attention mechanism to infuse global distribution information into the untouched channels of PConv. Second, we introduce the spatial-wise attention to the MLP layer to further improve model accuracy. Finally, we replace PAT_ch in the last stage with the self-attention mechanism to extend the global receptive field. Building upon PAT, we propose a novel hybrid network family, named PATNet, which achieves superior top-1 accuracy and inference speed compared to FasterNet on ImageNet-1K classification and excel in both detection and segmentation on the COCO dataset. Particularly, our PATNet-T2 achieves 1.3% higher accuracy than FasterNet-T2, while exhibiting 25% higher GPU throughput and 24% lower CPU latency.
