Table of Contents
Fetching ...

Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals

Hanze Li, Xiande Huang

TL;DR

The paper tackles redundancy in layer attention, where adjacent layers learn highly similar attention weights, leading to wasted computation and limited feature diversity. It introduces Efficient Layer Attention (ELA), which uses a KL divergence between adjacent layers to quantify redundancy and employs Enhanced Beta Quantile Mapping (EBQM) to stabilize pruning, yielding a dynamic layer-skipping mechanism. Empirical results across image classification and object detection show that ELA improves accuracy and reduces training time by about 30–35% compared with prior MRLA-based methods, while maintaining competitive parameter efficiency. The approach offers a practical, scalable solution to more efficient layer-interaction in deep networks, with robust ablations confirming EBQM’s stability and superiority over alternative quantile-mapping and distribution choices.

Abstract

Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer from redundancy, as attention weights learned by adjacent layers often become highly similar. This redundancy causes multiple layers to extract nearly identical features, reducing the model's representational capacity and increasing training time. To address this issue, we propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers. Additionally, we introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers, thereby maintaining model stability. Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30% reduction in training time while enhancing performance in tasks such as image classification and object detection.

Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals

TL;DR

The paper tackles redundancy in layer attention, where adjacent layers learn highly similar attention weights, leading to wasted computation and limited feature diversity. It introduces Efficient Layer Attention (ELA), which uses a KL divergence between adjacent layers to quantify redundancy and employs Enhanced Beta Quantile Mapping (EBQM) to stabilize pruning, yielding a dynamic layer-skipping mechanism. Empirical results across image classification and object detection show that ELA improves accuracy and reduces training time by about 30–35% compared with prior MRLA-based methods, while maintaining competitive parameter efficiency. The approach offers a practical, scalable solution to more efficient layer-interaction in deep networks, with robust ablations confirming EBQM’s stability and superiority over alternative quantile-mapping and distribution choices.

Abstract

Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer from redundancy, as attention weights learned by adjacent layers often become highly similar. This redundancy causes multiple layers to extract nearly identical features, reducing the model's representational capacity and increasing training time. To address this issue, we propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers. Additionally, we introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers, thereby maintaining model stability. Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30% reduction in training time while enhancing performance in tasks such as image classification and object detection.

Paper Structure

This paper contains 19 sections, 17 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Visualization of attention outputs from six consecutive layers on ResNet-101. The values below each attention output ($o_t$) represent the KL divergence of the attention weights between the $t$-th and $t-1$-th layers.
  • Figure 2: Comparison of per epoch training time (in seconds) for different layer interaction models with varying depths under identical training conditions.
  • Figure 3:
  • Figure 4: Visualization of attention scores at different stages of layer attention on the ResNet-56 backbone for CIFAR-100.
  • Figure 5: Visualization of attention scores at different stages of layer attention on the ResNet-50 backbone for ImageNet-1K.