Table of Contents
Fetching ...

Resistive memory-based zero-shot liquid state machine for multimodal event data learning

Ning Lin, Shaocong Wang, Yi Li, Bo Wang, Shuhui Shi, Yangu He, Woyu Zhang, Yifei Yu, Yue Zhang, Xinyuan Zhang, Kwunhang Wong, Songqi Wang, Xiaoming Chen, Hao Jiang, Xumeng Zhang, Peng Lin, Xiaoxin Xu, Xiaojuan Qi, Zhongrui Wang, Dashan Shang, Qi Liu, Ming Liu

TL;DR

This study presents a neuromorphic computing platform capable of learning cross-modal, event-driven signals for efficient real-time knowledge generalization and achieves zero-shot transfer learning for multimodal data.

Abstract

The human brain is a complex spiking neural network (SNN), capable of learning multimodal signals in a zero-shot manner by generalizing existing knowledge. Remarkably, it maintains minimal power consumption through event-based signal propagation. However, replicating the human brain in neuromorphic hardware presents both hardware and software challenges. Hardware limitations, such as the slowdown of Moore's law and Von Neumann bottleneck, hinder the efficiency of digital computers. Additionally, SNNs are characterized by their software training complexities. To this end, we propose a hardware-software co-design on a 40 nm 256 Kb in-memory computing macro that physically integrates a fixed and random liquid state machine (LSM) SNN encoder with trainable artificial neural network (ANN) projections. We showcase the zero-shot LSM-based learning of multimodal events on the N-MNIST and N-TIDIGITS datasets, including visual and audio data association, as well as neural and visual data alignment for brain-machine interfaces. Our co-design achieves classification accuracy comparable to fully optimized software models, resulting in a 152.83 and 393.07-fold reduction in training costs compared to SOTA contrastive language-image pre-training (CLIP) and Prototypical networks, and a 23.34 and 160-fold improvement in energy efficiency compared to cutting-edge digital hardware, respectively. These proof-of-principle prototypes demonstrate zero-shot multimodal events learning capability for emerging efficient and compact neuromorphic hardware.

Resistive memory-based zero-shot liquid state machine for multimodal event data learning

TL;DR

This study presents a neuromorphic computing platform capable of learning cross-modal, event-driven signals for efficient real-time knowledge generalization and achieves zero-shot transfer learning for multimodal data.

Abstract

The human brain is a complex spiking neural network (SNN), capable of learning multimodal signals in a zero-shot manner by generalizing existing knowledge. Remarkably, it maintains minimal power consumption through event-based signal propagation. However, replicating the human brain in neuromorphic hardware presents both hardware and software challenges. Hardware limitations, such as the slowdown of Moore's law and Von Neumann bottleneck, hinder the efficiency of digital computers. Additionally, SNNs are characterized by their software training complexities. To this end, we propose a hardware-software co-design on a 40 nm 256 Kb in-memory computing macro that physically integrates a fixed and random liquid state machine (LSM) SNN encoder with trainable artificial neural network (ANN) projections. We showcase the zero-shot LSM-based learning of multimodal events on the N-MNIST and N-TIDIGITS datasets, including visual and audio data association, as well as neural and visual data alignment for brain-machine interfaces. Our co-design achieves classification accuracy comparable to fully optimized software models, resulting in a 152.83 and 393.07-fold reduction in training costs compared to SOTA contrastive language-image pre-training (CLIP) and Prototypical networks, and a 23.34 and 160-fold improvement in energy efficiency compared to cutting-edge digital hardware, respectively. These proof-of-principle prototypes demonstrate zero-shot multimodal events learning capability for emerging efficient and compact neuromorphic hardware.
Paper Structure (21 sections, 7 equations, 5 figures)

This paper contains 21 sections, 7 equations, 5 figures.

Figures (5)

  • Figure 1: Hardware software co-design using the hybrid analogue-digital system for a combined LSM-ANN model.a, The liquid state machine (LSM) is a fixed, random and recurrent spiking neural network (SNN), which encodes multimodal event signals (e.g. images and sounds). The LSM is implemented using the analogue random resistive memory. b, The projection layers are trainable artificial neural network (ANN) layers that map accumulated spiking features from different modalities to real-valued feature vectors. The projection layers are implemented digitally and optimized by minimizing contrastive loss for zero-shot learning. c, Optical photo of the 40 nm 256 Kb resistive memory-based in-memory computing macro. d, A cross-sectional transmission electron micrograph shows the resistive memory crossbar array, fabricated by the backend-of-line process to integrate with complementary metal-oxide-semiconductor (CMOS). e, The cross-sectional transmission electron micrograph reveals a $TaN$/$TaO_{x}$/$Ta$/$TiN$ resistive memory cell, operating as a stochastic resistor subsequent to a post-dielectric breakdown. f, The schematic of the hybrid analogue-digital system. g, Conductance map of a 456$\times$201 resistive memory subarray shared by two LSM encoders (456$\times$201 and 264$\times$201) for different modalities (see Supplementary Fig.1). h, Corresponding 30,000-cycle array conductance reading variance. i, The histogram of g. j, The cycle-to-cycle conductance of 40 randomly sampled resistive memory cells over 30,000 read cycles.
  • Figure 2: Event-based image classification with the N-MNIST dataset.a, Schematics of event-based image capture. Pixel changes between consecutive frames are encoded as events and input into the LSM encoder for spike number embedding, followed by a fully connected ANN classification layer. b, LSM hidden neuron spikes corresponding to the digit "7". c, Associated membrane potentials for selected neurons. d, Distribution of spike number embeddings for the test set, showing clear clustering. e, Confusion matrix of dominant diagonal elements. f, Accuracy comparison between experimental and simulated models (LSM-ANN), as well as fully trainable counterparts with SNN/ANN encoders/classifiers (RNN-ANN, SRNN-ANN, and SRNN-SNN), indicating a small performance gap. g, Breakdown of training complexity. The LSM exhibits a substantially lower training complexity compared to SRNN-SNN (by 817.95-fold) and SRNN-ANN (by 802.18-fold), highlighting a substantial reduction in the training cost of the LSM. h, Breakdown of inference energy of the model across different hardware platforms. The hybrid analogue-digital system showcases a 29.97-fold reduction in energy consumption compared to state-of-the-art digital hardware, emphasizing the high efficiency of resistive memory (RM).
  • Figure 3: Event-based audio classification with the N-TIDIGITS dataset.a, Schematic representation of the event-based audio data capture process using a dynamic audio sensor, comprising a filter bank and an event encoder. b, Example of time-binned event data for the spoken digit "1", serving as input to the LSM. c, Associated LSM neuron spikes corresponding to the input. d, Membrane potentials of selected neurons in response to the input. e, 3D distribution of spike number embeddings of the test set, visualized using t-SNE, demonstrating unsupervised clustering. f, Confusion matrix characterized by prominent diagonal elements, indicating high classification accuracy. g, Accuracy comparison between experimental and simulated model (LSM-ANN), as well as fully trainable counterparts with SNN/ANN encoders/classifiers (RNN-ANN, SRNN-ANN, and SRNN-SNN), exhibiting similar performance. h, Breakdown of training complexity. LSM's training complexity is substantially lower than that of SRNN-SNN (by 1102.92-fold) and SRNN-ANN (by 1061.60-fold), demonstrating the substantial reduction in training cost by the LSM. i, Breakdown of inference energy across various hardware platforms. The estimated inference energy for the digit "7" is approximately 22.07-fold smaller when compared to a fully digital implementation, confirming the enhanced energy efficiency of resistive memory (RM).
  • Figure 4: Zero-shot transfer learning of multimodal event data.a, Zero-shot transfer learning event visual and audio data with the LSM-ANN using contrastive loss. The model is trained on event images "1" to "7" from the N-MNIST dataset and audios "one" to "seven" from the N-TIDIGITS dataset. The queries of unseen classes encompass images "8" and "9" from N-MNIST as well as audios "eight" and "nine" from N-TIDIGITS. b, Distribution of projected query samples from both seen and unseen classes using t-SNE, where same classes of different modalities are clearly aligned while different classes are distinguishable (see Supplementary Fig.22 for 3D t-SNE). c, Comparison of zero-shot classification accuracy, the LSM-ANN model attains comparable accuracy to state-of-the-art SRNN-based CLIP and Prototypical networks. d, Confusion matrix of query samples of seen ("1" to "7") and unseen classes ("8" and "9"). e, Training cost breakdown of different zero-shot transfer models. The LSM-ANN model features 152.83-fold reduction of training complexity compared to SRNN-based CLIP and Prototypical networks (see Supplementary Fig.23 for the CLIP model on the shared resistive memory). f, Inference energy of different hardware platforms. The hybrid analogue-digital system shows 23.34-fold improvement over state-of-the-art fully digital implementation (see Supplementary Table 17 for details).
  • Figure 5: Zero-shot transfer learning of brain-machine interface.a, Zero-shot transfer learning of neural and visual events with the simulated co-design using contrastive learning. The model is trained on captured neural recordings and corresponding event images of 22 randomly selected alphabets. The remaining 4 unseen classes are queries, namely "S", "T", "U", and "V". b, Distribution of feature embeddings from the projection layer visualized using t-SNE. Like the previous example, embeddings from the same class but different modalities are well aligned, while those from different classes do not overlap. c, Confusion matrix for event image retrieval based on neural recording queries in both seen and unseen classes, with dominant diagonal elements. d, Comparison of the top-1 and top-5 classification accuracy. The LSM-ANN model is close to (better than) the fully trainable SRNN-based CLIP (Prototypical) network. e, Corresponding training cost breakdown. The LSM-ANN model demonstrates a 393.07-fold reduction in training complexity compared to SRNN-based CLIP and Prototypical networks. f, Comparison of forward inference energy consumption across different hardware platforms. The hybrid analogue-digital system exhibits more than a 160-fold improvement over state-of-the-art fully digital implementations.