Table of Contents
Fetching ...

Collapse or Preserve: Data-Dependent Temporal Aggregation for Spiking Neural Network Acceleration

Jiahao Qin

Abstract

Spike sparsity is widely believed to enable efficient spiking neural network (SNN) inference on GPU hardware. We demonstrate this is an illusion: five distinct sparse computation strategies on Apple M3 Max all fail to outperform dense convolution, because SIMD architectures cannot exploit the fine-grained, unstructured sparsity of i.i.d. binary spikes. Instead, we propose Temporal Aggregated Convolution (TAC), which exploits convolution linearity to pre-aggregate $K$ spike frames before a single convolution call, reducing $T$ calls to $T/K$. On rate-coded data, TAC achieves 13.8times speedup with +1.6% accuracy on MNIST and +5.4% on Fashion-MNIST -- a simultaneous improvement in both speed and accuracy. However, on event-based data where the temporal dimension carries genuine motion information, TAC's temporal collapse is harmful. We therefore introduce TAC-TP (Temporal Preservation), which shares each group's convolution output across K independent LIF steps, preserving full temporal resolution for downstream layers. On DVS128-Gesture, TAC-TP achieves 95.1% accuracy (vs. 96.3% baseline) with 50% fewer convolution calls, while standard TAC drops to 91.3%. Our key finding is that the optimal temporal aggregation strategy is data-dependent: collapse the temporal dimension for rate-coded data (noise reduction) but preserve it for event data (information retention). Speedup is hardware-agnostic: TAC achieves 11.0times on NVIDIA V100, confirming the mechanism transfers across GPU architectures. All operators in the mlx-snn library are open source.

Collapse or Preserve: Data-Dependent Temporal Aggregation for Spiking Neural Network Acceleration

Abstract

Spike sparsity is widely believed to enable efficient spiking neural network (SNN) inference on GPU hardware. We demonstrate this is an illusion: five distinct sparse computation strategies on Apple M3 Max all fail to outperform dense convolution, because SIMD architectures cannot exploit the fine-grained, unstructured sparsity of i.i.d. binary spikes. Instead, we propose Temporal Aggregated Convolution (TAC), which exploits convolution linearity to pre-aggregate spike frames before a single convolution call, reducing calls to . On rate-coded data, TAC achieves 13.8times speedup with +1.6% accuracy on MNIST and +5.4% on Fashion-MNIST -- a simultaneous improvement in both speed and accuracy. However, on event-based data where the temporal dimension carries genuine motion information, TAC's temporal collapse is harmful. We therefore introduce TAC-TP (Temporal Preservation), which shares each group's convolution output across K independent LIF steps, preserving full temporal resolution for downstream layers. On DVS128-Gesture, TAC-TP achieves 95.1% accuracy (vs. 96.3% baseline) with 50% fewer convolution calls, while standard TAC drops to 91.3%. Our key finding is that the optimal temporal aggregation strategy is data-dependent: collapse the temporal dimension for rate-coded data (noise reduction) but preserve it for event data (information retention). Speedup is hardware-agnostic: TAC achieves 11.0times on NVIDIA V100, confirming the mechanism transfers across GPU architectures. All operators in the mlx-snn library are open source.
Paper Structure (45 sections, 3 theorems, 8 equations, 5 figures, 10 tables, 2 algorithms)

This paper contains 45 sections, 3 theorems, 8 equations, 5 figures, 10 tables, 2 algorithms.

Key Result

Proposition 1

Let $S \in \{0,1\}^n$ be a spike vector with i.i.d. entries $S_i \sim \text{Bernoulli}(\rho)$, processed by a SIMD unit of width $W$. The fraction of skippable lanes (all elements zero) is $f_\text{skip} = (1-\rho)^W$. For $\rho=0.1, W=32$: $f_\text{skip} = 0.035$. The theoretical maximum speedup $1

Figures (5)

  • Figure 1: TAC (Temporal Collapse) operator. $K$ input spike frames are pre-aggregated with exponential decay weights into a single frame $A_k$, which undergoes one Conv2d call (instead of $K$). The LIF neuron fires once per group, producing $T/K$ output timesteps. This temporal collapse acts as ensemble averaging on rate-coded data, improving accuracy while providing up to $13.8\times$ speedup.
  • Figure 2: TAC-TP (Temporal Preservation) operator. The same aggregation and single Conv2d call as TAC, but the output $Y_k$ is broadcast to $K$ independent LIF steps, each producing a separate output spike. This preserves the full temporal dimension $T$, which is critical for event-based data where temporal structure encodes motion information.
  • Figure 3: TAC efficiency on rate-coded data (M3 Max). (a) Training speedup scales near-linearly with group size $K$, reaching $13.8\times$ at $K\!=\!16$. (b) TAC simultaneously improves accuracy by $+1.6\%$ (MNIST) and $+5.4\%$ (FMNIST) through implicit ensemble averaging of Poisson samples.
  • Figure 4: DVS-Gesture efficiency comparison. (a) TAC-TP reduces convolution calls by $2$--$8\times$ while TAC $K\!=\!2$ uses the same 40 calls but collapses temporal resolution. (b) TAC-TP $K\!=\!2$ retains $95.1\%$ accuracy ($-1.2\%$), significantly outperforming TAC's $91.3\%$. (c) Wall-clock training time per epoch shows meaningful reduction for all operators.
  • Figure 5: Cross-platform speedup on MNIST. TAC's near-linear speedup with $K$ is consistent across Apple M3 Max (MLX) and NVIDIA V100 (PyTorch), confirming the mechanism is hardware-agnostic. TAC-TP provides modest speedup ($1.5\times$) as LIF dynamics still run per-timestep.

Theorems & Definitions (6)

  • Proposition 1: SIMD Sparsity Bound
  • Definition 1: TAC Operator
  • Theorem 1: TAC Approximation Error Bound
  • proof : Proof sketch
  • Proposition 2: TAC-TP Temporal Resolution Guarantee
  • proof