Table of Contents
Fetching ...

DTMM: Deploying TinyML Models on Extremely Weak IoT Devices with Pruning

Lixiang Han, Zhen Xiao, Zhenjiang Li

TL;DR

DTMM tackles the challenge of deploying TinyML models on ultra-constrained MCUs by introducing filterlet-based pruning and a compact storage format (FWCS) that enables fine-grained pruning with low indexing overhead. It couples a specialized SIMD-accelerated convolution operator with a pruning strategy scheduler, delivering a practical end-to-end solution that remains compatible with commercial ML frameworks. Prototype results on Cortex-M55 show that DTMM can substantially reduce model size and inference latency relative to both structured and unstructured baselines while maintaining accuracy, and it fits within tight SRAM budgets. This work offers a scalable path toward autonomous, privacy-preserving on-device intelligence for widespread IoT deployments.

Abstract

DTMM is a library designed for efficient deployment and execution of machine learning models on weak IoT devices such as microcontroller units (MCUs). The motivation for designing DTMM comes from the emerging field of tiny machine learning (TinyML), which explores extending the reach of machine learning to many low-end IoT devices to achieve ubiquitous intelligence. Due to the weak capability of embedded devices, it is necessary to compress models by pruning enough weights before deploying. Although pruning has been studied extensively on many computing platforms, two key issues with pruning methods are exacerbated on MCUs: models need to be deeply compressed without significantly compromising accuracy, and they should perform efficiently after pruning. Current solutions only achieve one of these objectives, but not both. In this paper, we find that pruned models have great potential for efficient deployment and execution on MCUs. Therefore, we propose DTMM with pruning unit selection, pre-execution pruning optimizations, runtime acceleration, and post-execution low-cost storage to fill the gap for efficient deployment and execution of pruned models. It can be integrated into commercial ML frameworks for practical deployment, and a prototype system has been developed. Extensive experiments on various models show promising gains compared to state-of-the-art methods.

DTMM: Deploying TinyML Models on Extremely Weak IoT Devices with Pruning

TL;DR

DTMM tackles the challenge of deploying TinyML models on ultra-constrained MCUs by introducing filterlet-based pruning and a compact storage format (FWCS) that enables fine-grained pruning with low indexing overhead. It couples a specialized SIMD-accelerated convolution operator with a pruning strategy scheduler, delivering a practical end-to-end solution that remains compatible with commercial ML frameworks. Prototype results on Cortex-M55 show that DTMM can substantially reduce model size and inference latency relative to both structured and unstructured baselines while maintaining accuracy, and it fits within tight SRAM budgets. This work offers a scalable path toward autonomous, privacy-preserving on-device intelligence for widespread IoT deployments.

Abstract

DTMM is a library designed for efficient deployment and execution of machine learning models on weak IoT devices such as microcontroller units (MCUs). The motivation for designing DTMM comes from the emerging field of tiny machine learning (TinyML), which explores extending the reach of machine learning to many low-end IoT devices to achieve ubiquitous intelligence. Due to the weak capability of embedded devices, it is necessary to compress models by pruning enough weights before deploying. Although pruning has been studied extensively on many computing platforms, two key issues with pruning methods are exacerbated on MCUs: models need to be deeply compressed without significantly compromising accuracy, and they should perform efficiently after pruning. Current solutions only achieve one of these objectives, but not both. In this paper, we find that pruned models have great potential for efficient deployment and execution on MCUs. Therefore, we propose DTMM with pruning unit selection, pre-execution pruning optimizations, runtime acceleration, and post-execution low-cost storage to fill the gap for efficient deployment and execution of pruned models. It can be integrated into commercial ML frameworks for practical deployment, and a prototype system has been developed. Extensive experiments on various models show promising gains compared to state-of-the-art methods.
Paper Structure (23 sections, 8 equations, 15 figures, 2 tables, 1 algorithm)

This paper contains 23 sections, 8 equations, 15 figures, 2 tables, 1 algorithm.

Figures (15)

  • Figure 1: Illustration of weight pruning to fit TinyML models on MCUs. Blue squares represent the weights to be pruned. (a) One convolution layer contains multiple filters. (b) Structured pruning removes all the weights from the selected filter(s). (c) Unstructured pruning can remove arbitrary weights. (d) DTMM removes all the weights from the selected filterlets.
  • Figure 2: Overview of the DTMM design.
  • Figure 3: Weights in filters are stored contiguously in physical storage following a channel-major order. We plot the physical storage in three lines.
  • Figure 4: Illustration of how unpruned weights from three filters are managed with CSR structure in unstructured pruning.
  • Figure 5: Illustration of how weights are pruned with filterlet and how remaining weights are stored using FWCS.
  • ...and 10 more figures