Table of Contents
Fetching ...

Frequency Dynamic Convolution for Dense Image Prediction

Linwei Chen, Lin Gu, Liang Li, Chenggang Yan, Ying Fu

TL;DR

FDConv addresses the limited frequency adaptability and parameter overhead of existing dynamic convolutions by learning a fixed Fourier-domain budget. It introduces Fourier Disjoint Weight (FDW) to create many diverse weights via disjoint Fourier groups, and augments them with Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM) to enable dense, spatially and band-wise frequency adjustments. Across object detection, segmentation, and classification benchmarks, FDConv achieves state-of-the-art or competitive results with modest parameter increases, demonstrating strong cross-architecture applicability to ConvNets and vision transformers. This frequency-forward approach provides a practical, efficient pathway to richer frequency-aware feature learning for dense image prediction.

Abstract

While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content. Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M). Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern vision tasks. The code is made publicly available at https://github.com/Linwei-Chen/FDConv.

Frequency Dynamic Convolution for Dense Image Prediction

TL;DR

FDConv addresses the limited frequency adaptability and parameter overhead of existing dynamic convolutions by learning a fixed Fourier-domain budget. It introduces Fourier Disjoint Weight (FDW) to create many diverse weights via disjoint Fourier groups, and augments them with Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM) to enable dense, spatially and band-wise frequency adjustments. Across object detection, segmentation, and classification benchmarks, FDConv achieves state-of-the-art or competitive results with modest parameter increases, demonstrating strong cross-architecture applicability to ConvNets and vision transformers. This frequency-forward approach provides a practical, efficient pathway to richer frequency-aware feature learning for dense image prediction.

Abstract

While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content. Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M). Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern vision tasks. The code is made publicly available at https://github.com/Linwei-Chen/FDConv.

Paper Structure

This paper contains 10 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Weight frequency responses and t-SNE analyses. We set the number of weights to 4 to align with ODConv 2022odconv. (a) The frequency responses of the four parallel weights in ODConv are highly similar, indicating limited diversity. (b) In contrast, FDConv shows distinct frequency responses for each weight, spanning different parts of the frequency spectrum. (c) The t-SNE plot for ODConv reveals that the filters in the four weights are closely clustered, suggesting a lack of diversity. (d) The t-SNE plot for FDConv shows that the filters in the four weights have different distributions, indicating greater diversity.
  • Figure 2: Illustration of the proposed Frequency Dynamic Convolution, which consists of the Fourier Disjoint Weight (FDW), Kernel Spatial Modulation (KSM), and Frequency Band Modulation (FBM) modules. FC indicates fully connected layer.
  • Figure 3: Illustration of Fourier Disjoint Weight (FDW). The left figure illustrates the division of parameters into disjoint groups, ranging from low frequencies (center) to high frequencies (border). In this example, $n = 2$ groups are shown. The right figure demonstrates how to obtain the convolution weights from the learnable parameter group 0. It first transforms the learnable parameters with specific Fourier indices (with all other Fourier indices set to zero) using the inverse Discrete Fourier Transform (iDFT). The resulting spatial weights are then obtained by cropping the iDFT result into $k \times k$ patches and reshaping them into a weight tensor of size $k \times k \times C_{\text{in}} \times C_{\text{out}}$.
  • Figure 4: Illustration of Kernel Spatial Modulation (KSM). The KSM consists of two branches: the global channel branch and the local channel branch. The local channel branch employs a very lightweight 1-D convolution to obtain local channel information and predicts a dense modulation matrix of size $k \times k \times C_{\text{in}} \times C_{\text{out}}$. The global branch uses a fully connected layer to obtain the global channel information and predicts three dimension-wise modulation values along the input channel, output channel, and kernel spatial dimensions. The two branches are fused to obtain the final weight modulation matrix.
  • Figure 5: Weight similarity and frequency analyses. (a) demonstrates that existing dynamic convolution methods, such as ODConv 2022odconv, exhibit high cosine similarity ($>$0.88) among their 4 learned weights. The frequency analysis in (c) shows 4 representative ODConv layers from stage 1 to stage 4 of the model, and it demonstrates large homogeneity between the 4 weights. In contrast, the 4 weights of our proposed FDConv show zero similarity in (b), allowing each kernel to learn distinct and complementary features with diversified frequency response, as shown in (d).
  • ...and 1 more figures