Table of Contents
Fetching ...

MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng Wang, Ru Huang, Meng Li

TL;DR

For the first time, MCUBERT enables light-weight BERT models on commodity MCUs and processing more than 512 tokens with less than 256KB of memory.

Abstract

In this paper, we propose MCUBERT to enable language models like BERT on tiny microcontroller units (MCUs) through network and scheduling co-optimization. We observe the embedding table contributes to the major storage bottleneck for tiny BERT models. Hence, at the network level, we propose an MCU-aware two-stage neural architecture search algorithm based on clustered low-rank approximation for embedding compression. To reduce the inference memory requirements, we further propose a novel fine-grained MCU-friendly scheduling strategy. Through careful computation tiling and re-ordering as well as kernel design, we drastically increase the input sequence lengths supported on MCUs without any latency or accuracy penalty. MCUBERT reduces the parameter size of BERT-tiny and BERT-mini by 5.7$\times$ and 3.0$\times$ and the execution memory by 3.5$\times$ and 4.3$\times$, respectively. MCUBERT also achieves 1.5$\times$ latency reduction. For the first time, MCUBERT enables lightweight BERT models on commodity MCUs and processing more than 512 tokens with less than 256KB of memory.

MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

TL;DR

For the first time, MCUBERT enables light-weight BERT models on commodity MCUs and processing more than 512 tokens with less than 256KB of memory.

Abstract

In this paper, we propose MCUBERT to enable language models like BERT on tiny microcontroller units (MCUs) through network and scheduling co-optimization. We observe the embedding table contributes to the major storage bottleneck for tiny BERT models. Hence, at the network level, we propose an MCU-aware two-stage neural architecture search algorithm based on clustered low-rank approximation for embedding compression. To reduce the inference memory requirements, we further propose a novel fine-grained MCU-friendly scheduling strategy. Through careful computation tiling and re-ordering as well as kernel design, we drastically increase the input sequence lengths supported on MCUs without any latency or accuracy penalty. MCUBERT reduces the parameter size of BERT-tiny and BERT-mini by 5.7 and 3.0 and the execution memory by 3.5 and 4.3, respectively. MCUBERT also achieves 1.5 latency reduction. For the first time, MCUBERT enables lightweight BERT models on commodity MCUs and processing more than 512 tokens with less than 256KB of memory.

Paper Structure

This paper contains 33 sections, 4 equations, 11 figures, 5 tables, 3 algorithms.

Figures (11)

  • Figure 1: Enabling BERT on MCUs faces memory challenges: (a) the Flash storage limits the model size; (b) the SRAM memory limits the peak execution memory; (c) for long sequence lengths, memory requirements of both MHA and multi-layer perceptron (MLP) become bottleneck.
  • Figure 2: MCUBERT overview. (Params stands for parameters, Acc stands for MNLI accuracy, and OOM stands for out of memory.)
  • Figure 3: Our proposed NAS search formulation for embedding compression.
  • Figure 4: MLP scheduling to reduce peak memory. The tensor in yellow will be saved in SRAM at memory bottleneck.
  • Figure 5: MHA scheduling to reduce the tensor transformation latency and peak memory.
  • ...and 6 more figures