A Hybrid ANN-SNN Architecture for Low-Power and Low-Latency Visual Perception

Asude Aydin; Mathias Gehrig; Daniel Gehrig; Davide Scaramuzza

A Hybrid ANN-SNN Architecture for Low-Power and Low-Latency Visual Perception

Asude Aydin, Mathias Gehrig, Daniel Gehrig, Davide Scaramuzza

TL;DR

This work introduces a hybrid ANN-SNN architecture that uses a low-rate auxiliary ANN to initialize SNN states, enabling high-rate, low-latency predictions with reduced energy on event-based visual perception tasks. By replacing the slow convergence of SNNs with ANN-derived initial conditions and combining continuous ANN predictions with fast SNN updates, the method achieves substantial energy savings (up to 88%) with minimal accuracy loss (≈4%) compared to fully trained ANNs, and outperforms pure SNNs in MPJPE by a wide margin. The approach is validated on 2D and 3D event-based human pose estimation using DHP19 and Event-Human3.6M datasets, demonstrating strong energy-accuracy trade-offs and practical potential for edge deployments in neuromorphic and traditional hardware alike.

Abstract

Spiking Neural Networks (SNN) are a class of bio-inspired neural networks that promise to bring low-power and low-latency inference to edge devices through asynchronous and sparse processing. However, being temporal models, SNNs depend heavily on expressive states to generate predictions on par with classical artificial neural networks (ANNs). These states converge only after long transient periods, and quickly decay without input data, leading to higher latency, power consumption, and lower accuracy. This work addresses this issue by initializing the state with an auxiliary ANN running at a low rate. The SNN then uses the state to generate predictions with high temporal resolution until the next initialization phase. Our hybrid ANN-SNN model thus combines the best of both worlds: It does not suffer from long state transients and state decay thanks to the ANN, and can generate predictions with high temporal resolution, low latency, and low power thanks to the SNN. We show for the task of event-based 2D and 3D human pose estimation that our method consumes 88% less power with only a 4% decrease in performance compared to its fully ANN counterparts when run at the same inference rate. Moreover, when compared to SNNs, our method achieves a 74% lower error. This research thus provides a new understanding of how ANNs and SNNs can be used to maximize their respective benefits.

A Hybrid ANN-SNN Architecture for Low-Power and Low-Latency Visual Perception

TL;DR

Abstract

Paper Structure (34 sections, 8 equations, 10 figures, 6 tables)

This paper contains 34 sections, 8 equations, 10 figures, 6 tables.

Introduction
Related work
Hybrid ANN-SNN Architectures
Human Pose Estimation
Frame-based
Event-based
Methodology
Preliminaries
Spiking Neural Networks
Discretization and Training
Network Details
Experiments
Setup
Datasets
Metrics
...and 19 more sections

Figures (10)

Figure 1: Spiking Neural Networks (SNNs, top) are prone to long transient periods and state decay in the absence of input data, leading to lower accuracy and higher latency and power consumption. In this work (bottom), we solve this with an auxiliary artificial neural network (ANN) that initializes the SNN states at low rates. Our resulting hybrid architecture is simultaneously accurate and maintains the low-power and low-latency aspect of SNNs.
Figure 2: Overview of our method. Our method processes inputs as dense and spike-based representations. The ANN uses the dense representation to perform state and output initialization at low rates. The SNN then uses spikes to generate high-rate outputs until the next dense input.
Figure 3: Hybrid ANN - SNN architecture. The ANN (upper row of blocks) is fed with past events at time step $t_0$, where an initial output is predicted, and states of spiking neurons are initialized (orange blocks). Events of duration $\Delta T$ are fed sequentially to the SNN (lower row of blocks) for high-rate updates of the prediction.
Figure 4: Overview of the ablation experiments. Schematic of (A) pure SNN without state initialization and output initialization, (B) hybrid model without output initialization, (C) hybrid model without state initialization, and (D) our proposed hybrid model with state and output initialization. Plots of accuracy over time of our approach against (E) only output initialization and only state initialization ablated and (F) both ablated. All plots show 2D MPJPE scores on the entire test set for camera view #2, with initializations performed at $t=0$.
Figure 5: Effect of state initialization on spike firing rates across time steps. The SNN consumes 46 mW of energy before state initialization, while energy is decreased to 30 mW after state initialization.
...and 5 more figures

A Hybrid ANN-SNN Architecture for Low-Power and Low-Latency Visual Perception

TL;DR

Abstract

A Hybrid ANN-SNN Architecture for Low-Power and Low-Latency Visual Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (10)