Table of Contents
Fetching ...

M2RU: Memristive Minion Recurrent Unit for Continual Learning at the Edge

Abdullah M. Zyarah, Dhireesha Kudithipudi

TL;DR

This work tackles continual learning for temporal data on edge devices, where energy constraints and data movement bottlenecks hinder on-device training. It introduces M2RU, a mixed-signal accelerator mapping the Minion Recurrent Unit (MiRU) to memristor crossbars, enhanced with weighted-bit streaming, reservoir-based experience replay, and Direct Feedback Alignment for on-chip training. Hardware demonstration shows ≈15 GOPS at 48.6 mW (≈312 GOPS/W) with less than 5% accuracy loss relative to software, plus a training-endurance lifetime of up to 12.2 years when gradient sparsification is used. Overall, M2RU delivers high-throughput, energy-efficient real-time temporal adaptation at the edge, outperforming digital CMOS baselines by a substantial margin and enabling durable edge intelligence under domain shifts.

Abstract

Continual learning on edge platforms remains challenging because recurrent networks depend on energy-intensive training procedures and frequent data movement that are impractical for embedded deployments. This work introduces M2RU, a mixed-signal architecture that implements the minion recurrent unit for efficient temporal processing with on-chip continual learning. The architecture integrates weighted-bit streaming, which enables multi-bit digital inputs to be processed in crossbars without high-resolution conversion, and an experience replay mechanism that stabilizes learning under domain shifts. M2RU achieves 15 GOPS at 48.62 mW, corresponding to 312 GOPS per watt, and maintains accuracy within 5 percent of software baselines on sequential MNIST and CIFAR-10 tasks. Compared with a CMOS digital design, the accelerator provides 29X improvement in energy efficiency. Device-aware analysis shows an expected operational lifetime of 12.2 years under continual learning workloads. These results establish M2RU as a scalable and energy-efficient platform for real-time adaptation in edge-level temporal intelligence.

M2RU: Memristive Minion Recurrent Unit for Continual Learning at the Edge

TL;DR

This work tackles continual learning for temporal data on edge devices, where energy constraints and data movement bottlenecks hinder on-device training. It introduces M2RU, a mixed-signal accelerator mapping the Minion Recurrent Unit (MiRU) to memristor crossbars, enhanced with weighted-bit streaming, reservoir-based experience replay, and Direct Feedback Alignment for on-chip training. Hardware demonstration shows ≈15 GOPS at 48.6 mW (≈312 GOPS/W) with less than 5% accuracy loss relative to software, plus a training-endurance lifetime of up to 12.2 years when gradient sparsification is used. Overall, M2RU delivers high-throughput, energy-efficient real-time temporal adaptation at the edge, outperforming digital CMOS baselines by a substantial margin and enabling durable edge intelligence under domain shifts.

Abstract

Continual learning on edge platforms remains challenging because recurrent networks depend on energy-intensive training procedures and frequent data movement that are impractical for embedded deployments. This work introduces M2RU, a mixed-signal architecture that implements the minion recurrent unit for efficient temporal processing with on-chip continual learning. The architecture integrates weighted-bit streaming, which enables multi-bit digital inputs to be processed in crossbars without high-resolution conversion, and an experience replay mechanism that stabilizes learning under domain shifts. M2RU achieves 15 GOPS at 48.62 mW, corresponding to 312 GOPS per watt, and maintains accuracy within 5 percent of software baselines on sequential MNIST and CIFAR-10 tasks. Compared with a CMOS digital design, the accelerator provides 29X improvement in energy efficiency. Device-aware analysis shows an expected operational lifetime of 12.2 years under continual learning workloads. These results establish M2RU as a scalable and energy-efficient platform for real-time adaptation in edge-level temporal intelligence.

Paper Structure

This paper contains 21 sections, 20 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: High-level block diagram of the proposed memristor-based MiRU accelerator, which consists of an RNN network to process time-series information, and a data preparation unit to randomly capture examples from non-stationary input streams to be stored in the replay buffer after quantization.
  • Figure 2: Left: Mixed-signal architecture of the memristive M2RU accelerator designed for processing temporally structured data. Right: High-level overview of the on-chip training framework that enables continuous network adaptation within dynamic environments.
  • Figure 3: Left: Level-shifting circuit used to generate low-amplitude input pulses for weighted-bit streaming, supporting both positive and negative voltages for signed digital inputs. Right: $K$-WTA circuit used to approximate the softmax operation and sparsify gradients during on-chip learning.
  • Figure 4: Average test accuracy (after each task) of the proposed M2RU accelerator along with software counterpart trained with DFA and Adam optimizer when verified on sequential tasks from permuted MNIST (a: $n_h$ = 100, b: $n_h$=256) and split CIFAR-10 (c: $n_h$ = 100, d: $n_h$=256) datasets.
  • Figure 5: (a) The average percentage error when performing Matrix-vector multiplication during replay under uniform and stochastic quantization, (b) Estimated lifespan of M2RU before and after applying gradient sparsification, (c) Impact of network scaling and bit-precision on network latency with and without tiling (dotted lines), and (d) Breakdown of power consumption across the core units of the M2RU accelerator.