Table of Contents
Fetching ...

ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network

Junzhou Li, Manqi Zhao, Yilin Gao, Zhiheng Yu, Yin Li, Dongsheng Jiang, Li Xiao

TL;DR

ReGLA addresses the challenge of achieving high accuracy on high-resolution images under strict latency constraints by coupling efficient local feature extraction with a softmax-free, linear attention mechanism. The method introduces ELRF to enlarge receptive fields with depthwise convolutions, RGMA to provide memory-efficient global modeling, and a multi-teacher distillation framework to boost downstream performance. Empirical results on ImageNet, COCO, and ADE20K demonstrate state-of-the-art accuracy-latency trade-offs and strong transferability, especially after distillation. The work highlights the practicality of co-designing local operators with attention for edge deployments and shows that diverse teacher signals can substantially enhance dense-prediction and detection tasks.

Abstract

Balancing accuracy and latency on high-resolution images is a critical challenge for lightweight models, particularly for Transformer-based architectures that often suffer from excessive latency. To address this issue, we introduce \textbf{ReGLA}, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi-teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA-M achieves \textbf{80.85\%} Top-1 accuracy on ImageNet-1K at $224px$, with only \textbf{4.98 ms} latency at $512px$. Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of \textbf{3.1\%} AP on COCO object detection and \textbf{3.6\%} mIoU on ADE20K semantic segmentation, establishing it as a state-of-the-art solution for high-resolution visual applications.

ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network

TL;DR

ReGLA addresses the challenge of achieving high accuracy on high-resolution images under strict latency constraints by coupling efficient local feature extraction with a softmax-free, linear attention mechanism. The method introduces ELRF to enlarge receptive fields with depthwise convolutions, RGMA to provide memory-efficient global modeling, and a multi-teacher distillation framework to boost downstream performance. Empirical results on ImageNet, COCO, and ADE20K demonstrate state-of-the-art accuracy-latency trade-offs and strong transferability, especially after distillation. The work highlights the practicality of co-designing local operators with attention for edge deployments and shows that diverse teacher signals can substantially enhance dense-prediction and detection tasks.

Abstract

Balancing accuracy and latency on high-resolution images is a critical challenge for lightweight models, particularly for Transformer-based architectures that often suffer from excessive latency. To address this issue, we introduce \textbf{ReGLA}, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi-teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA-M achieves \textbf{80.85\%} Top-1 accuracy on ImageNet-1K at , with only \textbf{4.98 ms} latency at . Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of \textbf{3.1\%} AP on COCO object detection and \textbf{3.6\%} mIoU on ADE20K semantic segmentation, establishing it as a state-of-the-art solution for high-resolution visual applications.
Paper Structure (21 sections, 6 equations, 4 figures, 8 tables)

This paper contains 21 sections, 6 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison of Top-1 accuracy (%) against parameter size (M) across various lightweight vision models. ReGLA consistently achieves competitive accuracy with similar parameters.
  • Figure 2: The architecture of ReGLA. In Stage 1 and Stage 2, we utilize ELRF to extract local information. In Stage 3 and Stage 4, RGMA is employed to focus on global information. DWConv denotes depthwise convolution. The CPE module is a depthwise convolution with residual connections. $L_i$$(i=1,2,3,4)$ denotes the number of stages for each of the four stages, respectively. $C_i$$(i=1,2,3,4)$ denotes the number of channels in each stage, respectively.
  • Figure 3: Top-1 accuracy progression during distillation, measured via KNN (n=20) classification, shows diminishing improvements after the 18th epoch and only marginal gains up to the 30th. Training for 30 epochs strikes an optimal balance between computational cost and performance.
  • Figure 4: Performance comparison across various model sizes on multiple benchmarks. Left: Top-1 accuracy (%) measured using KNN (n=20) classification on ImageNet1K and CIFAR100 steadily ascends with increasing model size. Right: Mean Intersection over Union (mIoU) on ADE20K also rises as increasing model size, demonstrating strong scalability and generalization across tasks.