SDiT: Spiking Diffusion Model with Transformer

Shu Yang; Hanzhi Ma; Chengting Yu; Aili Wang; Er-Ping Li

SDiT: Spiking Diffusion Model with Transformer

Shu Yang, Hanzhi Ma, Chengting Yu, Aili Wang, Er-Ping Li

TL;DR

This paper explores a novel diffusion model architecture within spiking neural networks that can generate higher quality images with relatively lower computational cost and shorter sampling time and utilizes transformer to replace the commonly used U-net structure in mainstream diffusion models.

Abstract

Spiking neural networks (SNNs) have low power consumption and bio-interpretable characteristics, and are considered to have tremendous potential for energy-efficient computing. However, the exploration of SNNs on image generation tasks remains very limited, and a unified and effective structure for SNN-based generative models has yet to be proposed. In this paper, we explore a novel diffusion model architecture within spiking neural networks. We utilize transformer to replace the commonly used U-net structure in mainstream diffusion models. It can generate higher quality images with relatively lower computational cost and shorter sampling time. It aims to provide an empirical baseline for research of generative models based on SNNs. Experiments on MNIST, Fashion-MNIST, and CIFAR-10 datasets demonstrate that our work is highly competitive compared to existing SNN generative models.

SDiT: Spiking Diffusion Model with Transformer

TL;DR

Abstract

Paper Structure (16 sections, 13 equations, 4 figures, 3 tables)

This paper contains 16 sections, 13 equations, 4 figures, 3 tables.

Introduction
Preliminary
Proposed Method
Overview of SDiT Architecture
Embedding
Spiking Transformer Block
Reconstruction Module
Final Layer
Experiment
Experiment Settings
Evaluation Metrics
Implementation details
Comparisons
Ablation Study
Discussion
...and 1 more sections

Figures (4)

Figure 1: Diagram of SDiT architecture, illustrating the flow from input time and patch embeddings through multiple spiking transformer blocks with skip connections, culminating in a final processing stage with linear and convolutional layers for predicted noise generation.
Figure 2: Spiking Transformer Block. This block comprises skip connections from the previous block, Time-Mixing, LIF neurons, Channel-Mixing, and internal residual connections. Its core structure is analogous to that of a transformer block, with the key distinction being the substitution of self-attention by RWKV. The Reconstruction Module disassembles the previously concatenated Reconstruction Tokens, supplementing the original vector with transformed information that includes the intrinsic dynamics of LIF neurons. This process compensates for the information loss in the original vector after it has been processed by spiking neurons.
Figure 3: Samples on different datasets. The first three rows from the top represent MNIST, the middle three rows depict Fashion-MNIST, and the final three rows correspond to CIFAR-10.
Figure 4: Samples on MNIST without Reconstruction Module.

SDiT: Spiking Diffusion Model with Transformer

TL;DR

Abstract

SDiT: Spiking Diffusion Model with Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (4)