FSL-HDnn: A 40 nm Few-shot On-Device Learning Accelerator with Integrated Feature Extraction and Hyperdimensional Computing
Weihong Xu, Chang Eun Song, Haichao Yang, Leo Liu, Meng-Fan Chang, Carlos H. Diaz, Tajana Rosing, Mingu Kang
TL;DR
FSL-HDnn addresses the bottlenecks of on-device learning by integrating a parameter-efficient feature extractor with a gradient-free, single-pass HDC-based few-shot classifier. The architecture employs weight clustering to reduce FE computation, and a memory-efficient cyclic random projection for HV encoding, enabling end-to-end training with drastically lower energy and latency. Early-exit inference and batched single-pass training further boost responsiveness and hardware utilization, achieving 6 mJ/image training energy and up to 28 images/s on a 10-way 5-shot task in 40 nm CMOS. The results show competitive FSL accuracy with significantly lower training cost than gradient-based ODL accelerators, highlighting strong potential for energy-efficient edge learning. Overall, the work demonstrates a viable pathway for real-time, privacy-preserving on-device adaptation in resource-constrained environments.
Abstract
This paper introduces FSL-HDnn, an energy-efficient accelerator that implements the end-to-end pipeline of feature extraction and on-device few-shot learning (FSL). The accelerator addresses fundamental challenges of on-device learning (ODL) for resource-constrained edge applications through two synergistic modules: a parameter-efficient feature extractor employing weight clustering and an FSL classifier based on hyperdimensional computing (HDC). The feature extractor exploits the weight clustering mechanism to reduce computational complexity, while the HDC-based FSL classifier eliminates gradient-based back propagation operations, enabling single-pass training with substantially reduced latency. Additionally, FSL-HDnn enables low-latency ODL and inference via two proposed optimization strategies, including an early-exit mechanism with branch feature extraction and batched single-pass training that improves hardware utilization. Measurement results demonstrate that our chip fabricated in a 40 nm CMOS process delivers superior training energy efficiency of 6 mJ/image and end-to-end training throughput of 28 images/s on a 10-way 5-shot FSL task. The end-to-end training latency is also reduced by 2x to 20.9x compared to state-of-the-art ODL chips.
