Table of Contents
Fetching ...

Reducing Deep Network Complexity via Sparse Hierarchical Fourier Interaction Networks

Andrew Kiruluta, Samantha Williams

TL;DR

The paper tackles the high computational cost of long‑range interactions in deep networks by proposing Sparse Hierarchical Fourier Interaction Networks (SHFIN), a frequency‑domain operator that couples locality, sparsity, and low‑rank mixing. SHFIN processes input via hierarchical patchwise FFTs, applies a differentiable top‑K spectral mask learned with Gumbel‑Softmax, and employs a gated low‑rank bilinear mixer to model cross‑frequency interactions, achieving sub‑quadratic complexity and parameter efficiency. Across ImageNet‑1k, CIFAR, and WMT14 En→De, SHFIN delivers competitive or superior accuracy while reducing parameters, FLOPs, and latency relative to CNN, Transformer, and Fourier baselines. This work suggests a hardware‑friendly, interpretable alternative to traditional operators, with clear avenues for adaptive sparsity, hardware accelerators, and extensions to higher‑dimensional data.

Abstract

This paper presents a Sparse Hierarchical Fourier Interaction Networks, an architectural building block that unifies three complementary principles of frequency domain modeling: A hierarchical patch wise Fourier transform that affords simultaneous access to local detail and global context; A learnable, differentiable top K masking mechanism which retains only the most informative spectral coefficients, thereby exploiting the natural compressibility of visual and linguistic signals.

Reducing Deep Network Complexity via Sparse Hierarchical Fourier Interaction Networks

TL;DR

The paper tackles the high computational cost of long‑range interactions in deep networks by proposing Sparse Hierarchical Fourier Interaction Networks (SHFIN), a frequency‑domain operator that couples locality, sparsity, and low‑rank mixing. SHFIN processes input via hierarchical patchwise FFTs, applies a differentiable top‑K spectral mask learned with Gumbel‑Softmax, and employs a gated low‑rank bilinear mixer to model cross‑frequency interactions, achieving sub‑quadratic complexity and parameter efficiency. Across ImageNet‑1k, CIFAR, and WMT14 En→De, SHFIN delivers competitive or superior accuracy while reducing parameters, FLOPs, and latency relative to CNN, Transformer, and Fourier baselines. This work suggests a hardware‑friendly, interpretable alternative to traditional operators, with clear avenues for adaptive sparsity, hardware accelerators, and extensions to higher‑dimensional data.

Abstract

This paper presents a Sparse Hierarchical Fourier Interaction Networks, an architectural building block that unifies three complementary principles of frequency domain modeling: A hierarchical patch wise Fourier transform that affords simultaneous access to local detail and global context; A learnable, differentiable top K masking mechanism which retains only the most informative spectral coefficients, thereby exploiting the natural compressibility of visual and linguistic signals.

Paper Structure

This paper contains 20 sections, 11 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Depiction of the Sparse Hierarchical Fourier Interaction Network (SHFIN) block. Beginning with an input feature map $X\in\mathbb{R}^{L\times C}$, we first partition $X$ into $P$ non‑overlapping patches $X^{(p)}\in\mathbb{R}^{s\times s\times C}$. Each patch is transformed into the frequency domain via a Fast Fourier Transform $F^{(p)}[f,c] \;=\;\sum_{n=0}^{s-1}X^{(p)}[n,c]\;e^{-2\pi i\,f\,n/s},$ yielding spectral coefficients $F^{(p)}[f,c]$. We then apply a learnable $K$‑sparse binary mask $g\in\{0,1\}^s$, sampled via Gumbel–Softmax, to prune redundant frequencies: $\widetilde{F}^{(p)}[f,c] \;=\; g_f\,F^{(p)}[f,c], \quad \sum_{f}g_f = K.$ The retained tensor $Z^{(p)}\in\mathbb{C}^{K\times C}$ is projected into query, key, and value spaces by $Q = Z^{(p)}W_q,\quad K = Z^{(p)}W_k,\quad V = Z^{(p)}W_v,$ and mixed via a gated bilinear operation $M = \mathrm{softmax}\bigl(Q\,K^\top/\sqrt{r}\bigr)\,V.$ After mixing, we zero‑pad $M$ back to the full spectrum $\widehat{F}^{(p)}\in\mathbb{C}^{s\times C}$ and reconstruct spatial features with an inverse FFT: $\widehat{X}^{(p)}[n,c] \;=\;\frac{1}{s} \sum_{f=0}^{s-1}\widehat{F}^{(p)}[f,c]\;e^{2\pi i\,f\,n/s}.$ Finally, a residual connection and layer normalization fuse the transformed patch back into the original representation: $Y^{(p)} = \mathrm{LayerNorm}\bigl(X^{(p)} + \widehat{X}^{(p)}\bigr).$ This end‐to‐end spectral pipeline replaces both convolutional filters and quadratic self‐attention with a compact, spectrum‑sparse operator.