Table of Contents
Fetching ...

Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning

Stef Cuyckens, Xiaoling Yi, Nitish Satya Murthy, Chao Fang, Marian Verhelst

TL;DR

This work tackles on-device continual learning for robotics by leveraging Microscaling (MX) data formats to reduce memory and energy while preserving gradient accuracy. It introduces a precision-scalable MX MAC unit and a square-based MX PE array that supports all MX variants and enables efficient backpropagation with reduced weight storage. Hardware evaluations against the state-of-the-art Dacapo show a 51% memory footprint reduction and 4× higher effective training throughput at iso-peak throughput, with comparable energy efficiency. The proposed GeMM core and square-block processing facilitate robust edge learning for robotics without cloud reliance.

Abstract

Autonomous robots require efficient on-device learning to adapt to new environments without cloud dependency. For this edge training, Microscaling (MX) data types offer a promising solution by combining integer and floating-point representations with shared exponents, reducing energy consumption while maintaining accuracy. However, the state-of-the-art continuous learning processor, namely Dacapo, faces limitations with its MXINT-only support and inefficient vector-based grouping during backpropagation. In this paper, we present, to the best of our knowledge, the first work that addresses these limitations with two key innovations: (1) a precision-scalable arithmetic unit that supports all six MX data types by exploiting sub-word parallelism and unified integer and floating-point processing; and (2) support for square shared exponent groups to enable efficient weight handling during backpropagation, removing storage redundancy and quantization overhead. We evaluate our design against Dacapo under iso-peak-throughput on four robotics workloads in TSMC 16nm FinFET technology at 400MHz, reaching a 51% lower memory footprint, and 4x higher effective training throughput, while achieving comparable energy efficiency, enabling efficient robotics continual learning at the edge.

Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning

TL;DR

This work tackles on-device continual learning for robotics by leveraging Microscaling (MX) data formats to reduce memory and energy while preserving gradient accuracy. It introduces a precision-scalable MX MAC unit and a square-based MX PE array that supports all MX variants and enables efficient backpropagation with reduced weight storage. Hardware evaluations against the state-of-the-art Dacapo show a 51% memory footprint reduction and 4× higher effective training throughput at iso-peak throughput, with comparable energy efficiency. The proposed GeMM core and square-block processing facilitate robust edge learning for robotics without cloud reliance.

Abstract

Autonomous robots require efficient on-device learning to adapt to new environments without cloud dependency. For this edge training, Microscaling (MX) data types offer a promising solution by combining integer and floating-point representations with shared exponents, reducing energy consumption while maintaining accuracy. However, the state-of-the-art continuous learning processor, namely Dacapo, faces limitations with its MXINT-only support and inefficient vector-based grouping during backpropagation. In this paper, we present, to the best of our knowledge, the first work that addresses these limitations with two key innovations: (1) a precision-scalable arithmetic unit that supports all six MX data types by exploiting sub-word parallelism and unified integer and floating-point processing; and (2) support for square shared exponent groups to enable efficient weight handling during backpropagation, removing storage redundancy and quantization overhead. We evaluate our design against Dacapo under iso-peak-throughput on four robotics workloads in TSMC 16nm FinFET technology at 400MHz, reaching a 51% lower memory footprint, and 4x higher effective training throughput, while achieving comparable energy efficiency, enabling efficient robotics continual learning at the edge.

Paper Structure

This paper contains 15 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of goals, challenges, and our contributions: (1) precision-scalable MX MAC unit supporting all six MX formats combining INT and FP operations; (2) 64-element square MX groups for training at the edge and the design of a square-based MX PE Array for GeMM Core.
  • Figure 2: Validation loss curves of the concrete MX data types on 4 robotic training workloads compared with the FP32 baseline, showing that FP32 can be replaced by low-bit MX types towards robotics learning.
  • Figure 3: The precision-scalable MAC unit in the 3 scaling modes: (a) INT8, (b) FP8/FP6, and (c) FP4, respectively. The multiplication is indicated in yellow, the L1 adder in purple, the L2 adder in red, the FP accumulation addition in orange, and the accumulation register in green.
  • Figure 4: The L1 and L2 adders, the INT8 path is indicated in brown, FP8/FP6 in red and FP4 in blue.
  • Figure 5: Training computation graphs with MX vector and square block quantizers. Computational and memory footprint gains can be achieved for square blocks, by enabling storage of block-quantized parameters for backpropagation.
  • ...and 3 more figures