Table of Contents
Fetching ...

Frequency-Dynamic Attention Modulation for Dense Prediction

Linwei Chen, Lin Gu, Ying Fu

TL;DR

Vision Transformers suffer from frequency vanishing due to attention acting as a low-pass filter, which degrades fine-grained details in dense prediction. The authors propose Frequency-Dynamic Attention Modulation (FDAM), a circuit-inspired, plug-in framework consisting of Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale) to create a learnable, dynamic frequency response for ViTs. AttInv inverts low-pass attention per location to yield complementary high-pass components, enabling $2^L$ possible filter combinations across $L$ layers, while FreqScale re-weights spectral bands via an MLP-generated dynamic kernel to fine-tune the response. Across segmentation, detection, instance and panoptic tasks, and remote sensing detection, FDAM delivers consistent gains with minimal overhead, mitigates representation collapse, and provides new spectral insights into self-attention behavior, establishing a practical approach to frequency-aware transformers.

Abstract

Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures. We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function. Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings. The code is available at https://github.com/Linwei-Chen/FDAM.

Frequency-Dynamic Attention Modulation for Dense Prediction

TL;DR

Vision Transformers suffer from frequency vanishing due to attention acting as a low-pass filter, which degrades fine-grained details in dense prediction. The authors propose Frequency-Dynamic Attention Modulation (FDAM), a circuit-inspired, plug-in framework consisting of Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale) to create a learnable, dynamic frequency response for ViTs. AttInv inverts low-pass attention per location to yield complementary high-pass components, enabling possible filter combinations across layers, while FreqScale re-weights spectral bands via an MLP-generated dynamic kernel to fine-tune the response. Across segmentation, detection, instance and panoptic tasks, and remote sensing detection, FDAM delivers consistent gains with minimal overhead, mitigates representation collapse, and provides new spectral insights into self-attention behavior, establishing a practical approach to frequency-aware transformers.

Abstract

Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures. We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function. Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings. The code is available at https://github.com/Linwei-Chen/FDAM.

Paper Structure

This paper contains 9 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Frequency analysis. We stack a model with pure 12 attention layer. (a) Attention frequency response analysis reveals that our modulated attention maintains a higher mean magnitude across all frequency bands compared to standard attention, while also exhibiting greater diversity in high-frequency regions. (b) Feature spectrum analysis shows that our modulated attention maintains a stable high-frequency ratio and consistently preserves high-frequency information across layers, unlike standard attention, which rapidly loses it and results in representation collapse.
  • Figure 2: Analysis of frequency response fitting. From the center to the border are low- to high-frequency components. It is evident that the attention mechanism has strong low-pass characteristics, which makes it difficult to effectively fit high-pass, band pass/stop, and random filters. In contrast, our method, AttInv, demonstrates a superior capability in fitting these diverse frequency responses, indicating greater flexibility and effectiveness in handling a wide range of frequency characteristics.
  • Figure 3: Illustration of Frequency Dynamic Attention Modulation (FDAM), comprising AttInv for attention modulation and FreqScale for feature modulation. The original attention mechanism is predominantly influenced by low-frequency components due to its inherent low-pass filtering characteristics. i) AttInv inverts the low-pass filter, represented by the attention weights, to derive a high-pass filter. By dynamically combining these filters using a predicted weight, we achieve a balanced representation that retains both low- and high-frequency information. ii) FreqScale adaptively reweights different frequency bands, enhancing suppressed high-frequency components (e.g., edges, textures) while preserving structural low-frequency information. This integrated approach alleviates attention collapse and patch uniformity issues in Vision Transformers, facilitating full-spectrum feature representation for improved discriminability.
  • Figure 4: Illustration of frequency scaling weight generation. The input feature map of dimensions $C \times H \times W$ is first and then fed into an MLP with a Tanh activation function to generate dynamic coefficients of dimensions $g \times n$. These dynamic coefficients are multiplied with $n$ learnable static scaling weights $\in \mathbb{R}^{\frac{C}{g} \times b \times b}$ to produce the final scaling weights, which are upsampled to match the size of the feature map in the Fourier domain ($C \times H \times W$). This mechanism enables precise adjustment of various frequency components within the feature map dynamically.
  • Figure 5: (a) Effective rank analysis for feature rank collapse. Higher effective rank2023lowrankbias indicates a greater ability to capture complex patterns and nuanced details from the input data. FDAM maintains a consistently higher effective rank across all layers compared to the DeiT model using standard attention, demonstrating enhanced expressiveness of the attention mechanisms. (b) Feature similarity analysis. The cosing similarity increases with depth in the baseline DeiT model, indicating a loss of diversity in patch representations 2022antioversmoothing2023mitigating. The proposed FDAM method largely reduces this similarity, promoting more diverse features.
  • ...and 4 more figures