Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning

Stef Cuyckens; Xiaoling Yi; Nitish Satya Murthy; Chao Fang; Marian Verhelst

Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning

Stef Cuyckens, Xiaoling Yi, Nitish Satya Murthy, Chao Fang, Marian Verhelst

TL;DR

This work tackles on-device continual learning for robotics by leveraging Microscaling (MX) data formats to reduce memory and energy while preserving gradient accuracy. It introduces a precision-scalable MX MAC unit and a square-based MX PE array that supports all MX variants and enables efficient backpropagation with reduced weight storage. Hardware evaluations against the state-of-the-art Dacapo show a 51% memory footprint reduction and 4× higher effective training throughput at iso-peak throughput, with comparable energy efficiency. The proposed GeMM core and square-block processing facilitate robust edge learning for robotics without cloud reliance.

Abstract

Autonomous robots require efficient on-device learning to adapt to new environments without cloud dependency. For this edge training, Microscaling (MX) data types offer a promising solution by combining integer and floating-point representations with shared exponents, reducing energy consumption while maintaining accuracy. However, the state-of-the-art continuous learning processor, namely Dacapo, faces limitations with its MXINT-only support and inefficient vector-based grouping during backpropagation. In this paper, we present, to the best of our knowledge, the first work that addresses these limitations with two key innovations: (1) a precision-scalable arithmetic unit that supports all six MX data types by exploiting sub-word parallelism and unified integer and floating-point processing; and (2) support for square shared exponent groups to enable efficient weight handling during backpropagation, removing storage redundancy and quantization overhead. We evaluate our design against Dacapo under iso-peak-throughput on four robotics workloads in TSMC 16nm FinFET technology at 400MHz, reaching a 51% lower memory footprint, and 4x higher effective training throughput, while achieving comparable energy efficiency, enabling efficient robotics continual learning at the edge.

Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning

TL;DR

Abstract

Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)