Table of Contents
Fetching ...

FSL-HDnn: A 40 nm Few-shot On-Device Learning Accelerator with Integrated Feature Extraction and Hyperdimensional Computing

Weihong Xu, Chang Eun Song, Haichao Yang, Leo Liu, Meng-Fan Chang, Carlos H. Diaz, Tajana Rosing, Mingu Kang

TL;DR

FSL-HDnn addresses the bottlenecks of on-device learning by integrating a parameter-efficient feature extractor with a gradient-free, single-pass HDC-based few-shot classifier. The architecture employs weight clustering to reduce FE computation, and a memory-efficient cyclic random projection for HV encoding, enabling end-to-end training with drastically lower energy and latency. Early-exit inference and batched single-pass training further boost responsiveness and hardware utilization, achieving 6 mJ/image training energy and up to 28 images/s on a 10-way 5-shot task in 40 nm CMOS. The results show competitive FSL accuracy with significantly lower training cost than gradient-based ODL accelerators, highlighting strong potential for energy-efficient edge learning. Overall, the work demonstrates a viable pathway for real-time, privacy-preserving on-device adaptation in resource-constrained environments.

Abstract

This paper introduces FSL-HDnn, an energy-efficient accelerator that implements the end-to-end pipeline of feature extraction and on-device few-shot learning (FSL). The accelerator addresses fundamental challenges of on-device learning (ODL) for resource-constrained edge applications through two synergistic modules: a parameter-efficient feature extractor employing weight clustering and an FSL classifier based on hyperdimensional computing (HDC). The feature extractor exploits the weight clustering mechanism to reduce computational complexity, while the HDC-based FSL classifier eliminates gradient-based back propagation operations, enabling single-pass training with substantially reduced latency. Additionally, FSL-HDnn enables low-latency ODL and inference via two proposed optimization strategies, including an early-exit mechanism with branch feature extraction and batched single-pass training that improves hardware utilization. Measurement results demonstrate that our chip fabricated in a 40 nm CMOS process delivers superior training energy efficiency of 6 mJ/image and end-to-end training throughput of 28 images/s on a 10-way 5-shot FSL task. The end-to-end training latency is also reduced by 2x to 20.9x compared to state-of-the-art ODL chips.

FSL-HDnn: A 40 nm Few-shot On-Device Learning Accelerator with Integrated Feature Extraction and Hyperdimensional Computing

TL;DR

FSL-HDnn addresses the bottlenecks of on-device learning by integrating a parameter-efficient feature extractor with a gradient-free, single-pass HDC-based few-shot classifier. The architecture employs weight clustering to reduce FE computation, and a memory-efficient cyclic random projection for HV encoding, enabling end-to-end training with drastically lower energy and latency. Early-exit inference and batched single-pass training further boost responsiveness and hardware utilization, achieving 6 mJ/image training energy and up to 28 images/s on a 10-way 5-shot task in 40 nm CMOS. The results show competitive FSL accuracy with significantly lower training cost than gradient-based ODL accelerators, highlighting strong potential for energy-efficient edge learning. Overall, the work demonstrates a viable pathway for real-time, privacy-preserving on-device adaptation in resource-constrained environments.

Abstract

This paper introduces FSL-HDnn, an energy-efficient accelerator that implements the end-to-end pipeline of feature extraction and on-device few-shot learning (FSL). The accelerator addresses fundamental challenges of on-device learning (ODL) for resource-constrained edge applications through two synergistic modules: a parameter-efficient feature extractor employing weight clustering and an FSL classifier based on hyperdimensional computing (HDC). The feature extractor exploits the weight clustering mechanism to reduce computational complexity, while the HDC-based FSL classifier eliminates gradient-based back propagation operations, enabling single-pass training with substantially reduced latency. Additionally, FSL-HDnn enables low-latency ODL and inference via two proposed optimization strategies, including an early-exit mechanism with branch feature extraction and batched single-pass training that improves hardware utilization. Measurement results demonstrate that our chip fabricated in a 40 nm CMOS process delivers superior training energy efficiency of 6 mJ/image and end-to-end training throughput of 28 images/s on a 10-way 5-shot FSL task. The end-to-end training latency is also reduced by 2x to 20.9x compared to state-of-the-art ODL chips.

Paper Structure

This paper contains 32 sections, 6 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Challenges for existing on-device learning accelerators.
  • Figure 2: Comparison for different on-device learning algorithms (ODL): (a) full and (b) partial fine-tuning (FT), (c) learning pipeline in the proposed gradient-free FSL-HDnn architecture.
  • Figure 3: (a) FSL accuracy vs. training iterations for partial and full fine-tuning (FT) models and (b) accuracy vs. complexity (normalized to the smallest one) of kNN, partial FT, full FT, and FSL-HDnn algorithm.
  • Figure 4: Weight clustering: (a) average weight clustering and index for each weight, (b) partial sum reuse based on common weight.
  • Figure 5: Feature extraction (FE) output error, model compression, and operation reduction of FSL-HDnn as compared to INT8-quantized ResNet-18 when varying $Ch_{\text{sub}}$ from 8 to 256.
  • ...and 14 more figures