Table of Contents
Fetching ...

Spiking Neural Networks Need High Frequency Information

Yuetong Fang, Deming Zhou, Ziqing Wang, Hongwei Ren, ZeCui Zeng, Lusong Li, Shibo Zhou, Renjing Xu

TL;DR

The paper investigates why Spiking Neural Networks underperform compared with Artificial Neural Networks, identifying a fundamental frequency bias where spiking neurons naturally suppress high-frequency information. It provides a theoretical proof that spiking neurons act as low-pass filters and introduces Max-Former, which restores high-frequency content using Max-Pool in patch embedding and Depth-Wise Convolution in early token mixing, with final-stage SSA. Empirically, Max-Former achieves 82.39% top-1 on ImageNet (63.99M params)—outperforming Spikformer by +7.58% while consuming ~30% less energy—and also delivers strong CIFAR-10/100 and neuromorphic results, plus state-of-the-art performance for Max-ResNet-18 on CIFAR benchmarks. This work suggests that preserving high-frequency information is crucial for SNNs and offers simple, scalable architectural adjustments to enhance spike-based computation across vision tasks.

Abstract

Spiking Neural Networks promise brain-inspired and energy-efficient computation by transmitting information through binary (0/1) spikes. Yet, their performance still lags behind that of artificial neural networks, often assumed to result from information loss caused by sparse and binary activations. In this work, we challenge this long-standing assumption and reveal a previously overlooked frequency bias: spiking neurons inherently suppress high-frequency components and preferentially propagate low-frequency information. This frequency-domain imbalance, we argue, is the root cause of degraded feature representation in SNNs. Empirically, on Spiking Transformers, adopting Avg-Pooling (low-pass) for token mixing lowers performance to 76.73% on Cifar-100, whereas replacing it with Max-Pool (high-pass) pushes the top-1 accuracy to 79.12%. Accordingly, we introduce Max-Former that restores high-frequency signals through two frequency-enhancing operators: (1) extra Max-Pool in patch embedding, and (2) Depth-Wise Convolution in place of self-attention. Notably, Max-Former attains 82.39% top-1 accuracy on ImageNet using only 63.99M parameters, surpassing Spikformer (74.81%, 66.34M) by +7.58%. Extending our insight beyond transformers, our Max-ResNet-18 achieves state-of-the-art performance on convolution-based benchmarks: 97.17% on CIFAR-10 and 83.06% on CIFAR-100. We hope this simple yet effective solution inspires future research to explore the distinctive nature of spiking neural networks. Code is available: https://github.com/bic-L/MaxFormer.

Spiking Neural Networks Need High Frequency Information

TL;DR

The paper investigates why Spiking Neural Networks underperform compared with Artificial Neural Networks, identifying a fundamental frequency bias where spiking neurons naturally suppress high-frequency information. It provides a theoretical proof that spiking neurons act as low-pass filters and introduces Max-Former, which restores high-frequency content using Max-Pool in patch embedding and Depth-Wise Convolution in early token mixing, with final-stage SSA. Empirically, Max-Former achieves 82.39% top-1 on ImageNet (63.99M params)—outperforming Spikformer by +7.58% while consuming ~30% less energy—and also delivers strong CIFAR-10/100 and neuromorphic results, plus state-of-the-art performance for Max-ResNet-18 on CIFAR benchmarks. This work suggests that preserving high-frequency information is crucial for SNNs and offers simple, scalable architectural adjustments to enhance spike-based computation across vision tasks.

Abstract

Spiking Neural Networks promise brain-inspired and energy-efficient computation by transmitting information through binary (0/1) spikes. Yet, their performance still lags behind that of artificial neural networks, often assumed to result from information loss caused by sparse and binary activations. In this work, we challenge this long-standing assumption and reveal a previously overlooked frequency bias: spiking neurons inherently suppress high-frequency components and preferentially propagate low-frequency information. This frequency-domain imbalance, we argue, is the root cause of degraded feature representation in SNNs. Empirically, on Spiking Transformers, adopting Avg-Pooling (low-pass) for token mixing lowers performance to 76.73% on Cifar-100, whereas replacing it with Max-Pool (high-pass) pushes the top-1 accuracy to 79.12%. Accordingly, we introduce Max-Former that restores high-frequency signals through two frequency-enhancing operators: (1) extra Max-Pool in patch embedding, and (2) Depth-Wise Convolution in place of self-attention. Notably, Max-Former attains 82.39% top-1 accuracy on ImageNet using only 63.99M parameters, surpassing Spikformer (74.81%, 66.34M) by +7.58%. Extending our insight beyond transformers, our Max-ResNet-18 achieves state-of-the-art performance on convolution-based benchmarks: 97.17% on CIFAR-10 and 83.06% on CIFAR-100. We hope this simple yet effective solution inspires future research to explore the distinctive nature of spiking neural networks. Code is available: https://github.com/bic-L/MaxFormer.

Paper Structure

This paper contains 27 sections, 20 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Spiking Transformer architectures: (a) Avg-Pool vs. (b) Max-Pool for token mixing, with (c) detailed implementation of the Spiking MLP (S-MLP) block. In mainstream (non-spiking) Vision Transformer research, Avg-Pool that captures global low-frequency patterns is a more common token mixing strategy than Max-Pool (high-pass) yu2022metaformeryu2023metaformer. Surprisingly, in Spiking Transformers, replacing Avg-Pool with Max-Pool yields a +2.39% improvement on Cifar-100.
  • Figure 2: Comparison between ReLU and spiking neuron (S-Neuron): (a) Input images; (b) Fourier spectrum analysis of output features processed as input$\rightarrow$ activation$\rightarrow$ weighting, with high-frequency regions marked (red dashed boxes: regions >0.55$\times$ max amplitude) and (c) the corresponding relative log amplitude; (d) GradCAM comparison with identical architectural setting following wang2023masked, with the converted Spiking Transformer using 256 timesteps. Spiking neurons cause the rapid dissipation of high-frequency components, which consequently leads to the degradation of feature representations.
  • Figure 3: Time-frequency analysis of ReLU and spiking neurons. (a) Time-domain signals of input $x(t) = \frac{1}{3}(\sin(2\pi \cdot 100t) + \sin(2\pi \cdot 200t) + \sin(2\pi \cdot 300t))$ (blue), ReLU-processed: $r(t)$ (red), spiking output of a LIF neuron with the $\beta = 0.25$: $s(t)$ (green). (b) Fourier analysis of $x(t)$, $r(t)$, and $s(t)$. (c) Fourier analysis of linear transformed (CONV/MLP) activations, where ReLU expands the frequency bandwidth of the input signal, while the spiking neuron shows high-frequency attenuation.
  • Figure 4: (a) Overview of Max-Former: we restore high-frequency signals by using lightweight DWCs instead of self-attention in the early stages. Following the hierarchical design of liuSwinTransformerHierarchical2021, Max-Former adopts a 3-stage architecture. $D_i$: feature dimensions of stage-$i$. (b) In Max-Former's patch embedding stage, we propose three configurations (Embed-orig, Embed-Max, and Embed-Max+) to enhance high-frequency components.
  • Figure 5: Shortcut connection in SNNs. (Left) Vanilla Shortcut that combines spike and membrane potential. (Middle) Pre-Spike Shortcut that adds spike signals before neuron charging. (Right) Membrane Shortcut that directly connects membrane potentials, ensuring identical potential mapping while strictly preserving the spike-driven computing paradigm throughout the network.
  • ...and 3 more figures