Table of Contents
Fetching ...

Temporal-Aware Spiking Transformer Hashing Based on 3D-DWT

Zihao Mei, Jianhao Li, Bolin Zhang, Chong Wang, Lijun Guo, Guoqi Li, Jiangbo Qian

TL;DR

The paper tackles energy-efficient retrieval for dynamic vision sensor (DVS) data by introducing Spikinghash, a temporal-aware hashing framework for spiking neural networks. It combines Spiking WaveMixer (SWM), based on multilevel 3D-DWT, to capture local spatiotemporal features with Spiking Self-Attention (SSA) for global context, and generates binary hash codes directly from spike activities via a dedicated spiking hash layer. A dynamic soft similarity loss, built from membrane potentials, complements hard class labels to better reflect inter-class relations and mitigate information loss in SNNs. Extensive experiments across DVS and static image datasets show state-of-the-art retrieval performance with significantly fewer parameters and lower energy demands than ANN baselines, demonstrating practical benefits for neuromorphic, low-power retrieval tasks.

Abstract

With the rapid growth of dynamic vision sensor (DVS) data, constructing a low-energy, efficient data retrieval system has become an urgent task. Hash learning is one of the most important retrieval technologies which can keep the distance between hash codes consistent with the distance between DVS data. As spiking neural networks (SNNs) can encode information through spikes, they demonstrate great potential in promoting energy efficiency. Based on the binary characteristics of SNNs, we first propose a novel supervised hashing method named Spikinghash with a hierarchical lightweight structure. Spiking WaveMixer (SWM) is deployed in shallow layers, utilizing a multilevel 3D discrete wavelet transform (3D-DWT) to decouple spatiotemporal features into various low-frequency and high frequency components, and then employing efficient spectral feature fusion. SWM can effectively capture the temporal dependencies and local spatial features. Spiking Self-Attention (SSA) is deployed in deeper layers to further extract global spatiotemporal information. We also design a hash layer utilizing binary characteristic of SNNs, which integrates information over multiple time steps to generate final hash codes. Furthermore, we propose a new dynamic soft similarity loss for SNNs, which utilizes membrane potentials to construct a learnable similarity matrix as soft labels to fully capture the similarity differences between classes and compensate information loss in SNNs, thereby improving retrieval performance. Experiments on multiple datasets demonstrate that Spikinghash can achieve state-of-the-art results with low energy consumption and fewer parameters.

Temporal-Aware Spiking Transformer Hashing Based on 3D-DWT

TL;DR

The paper tackles energy-efficient retrieval for dynamic vision sensor (DVS) data by introducing Spikinghash, a temporal-aware hashing framework for spiking neural networks. It combines Spiking WaveMixer (SWM), based on multilevel 3D-DWT, to capture local spatiotemporal features with Spiking Self-Attention (SSA) for global context, and generates binary hash codes directly from spike activities via a dedicated spiking hash layer. A dynamic soft similarity loss, built from membrane potentials, complements hard class labels to better reflect inter-class relations and mitigate information loss in SNNs. Extensive experiments across DVS and static image datasets show state-of-the-art retrieval performance with significantly fewer parameters and lower energy demands than ANN baselines, demonstrating practical benefits for neuromorphic, low-power retrieval tasks.

Abstract

With the rapid growth of dynamic vision sensor (DVS) data, constructing a low-energy, efficient data retrieval system has become an urgent task. Hash learning is one of the most important retrieval technologies which can keep the distance between hash codes consistent with the distance between DVS data. As spiking neural networks (SNNs) can encode information through spikes, they demonstrate great potential in promoting energy efficiency. Based on the binary characteristics of SNNs, we first propose a novel supervised hashing method named Spikinghash with a hierarchical lightweight structure. Spiking WaveMixer (SWM) is deployed in shallow layers, utilizing a multilevel 3D discrete wavelet transform (3D-DWT) to decouple spatiotemporal features into various low-frequency and high frequency components, and then employing efficient spectral feature fusion. SWM can effectively capture the temporal dependencies and local spatial features. Spiking Self-Attention (SSA) is deployed in deeper layers to further extract global spatiotemporal information. We also design a hash layer utilizing binary characteristic of SNNs, which integrates information over multiple time steps to generate final hash codes. Furthermore, we propose a new dynamic soft similarity loss for SNNs, which utilizes membrane potentials to construct a learnable similarity matrix as soft labels to fully capture the similarity differences between classes and compensate information loss in SNNs, thereby improving retrieval performance. Experiments on multiple datasets demonstrate that Spikinghash can achieve state-of-the-art results with low energy consumption and fewer parameters.
Paper Structure (52 sections, 19 equations, 5 figures, 13 tables)

This paper contains 52 sections, 19 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: In HMDB51-DVS data, the actions "sit" and "stand" contain symmetric semantic information at different time steps. Distinguishing these actions requires recognizing the temporal order and dependencies across multiple time steps.
  • Figure 2: The overall architecture of Spikinghash. This hierarchical SNN-Transformer architecture includes a downsample layer before each stage. The first two stages deploy Spiking Waveformer Blocks (including Spiking WaveMixer (SWM) and MLP). The last two stages deploy Spiking Transformer Blocks (including Spiking Self-Attention (SSA) and MLP). The residual connections in the Spiking Waveformer Blocks and Spiking Transformer Blocks are omitted in the figure. Hash layer or classification head is connected depending on the downstream task.
  • Figure 3: The iterative multilevel decomposition process of 3D-DWT.
  • Figure 4: Comparison of the top-1 accuracies on CIFAR100 between Spikinghash and several SNNs zhou2023enhancingyao2024spikezhou2022spikformer and ANNs dosovitskiy2020image (Transformer with four blocks). Power represents the theoretical energy consumption during an evaluation. The bubble size corresponds to energy consumption.
  • Figure 5: Visualization of similarity matrixs $\boldsymbol{S_{\text{soft}}}$ based on membrane potentials and $\boldsymbol{S_{\text{hash}}}$ based on Hamming distances.