TinyM$^2$Net-V3: Memory-Aware Compressed Multimodal Deep Neural Networks for Sustainable Edge Deployment

Hasib-Al Rashid; Tinoosh Mohsenin

TinyM$^2$Net-V3: Memory-Aware Compressed Multimodal Deep Neural Networks for Sustainable Edge Deployment

Hasib-Al Rashid, Tinoosh Mohsenin

TL;DR

This work tackles the environmental and resource challenges of modern AI by proposing TinyM$^2$Net-V3, a memory-aware, multimodal deep learning framework for sustainable edge deployment. It integrates memory-aware knowledge distillation with uniform 8-bit quantization to compress multimodal models to fit within on-chip memories, enabling low-latency, low-power inference on GAPuino and Raspberry Pi 4B. The approach is evaluated on two case studies—COVID-19 detection from cough, speech, and breathing audio, and pose classification from depth and thermal images—achieving 6 KB and 58 KB inference footprints with accuracies near 93% and 91%, respectively, while maintaining millisecond latencies and favorable energy efficiency. The results demonstrate practical, energy-efficient multimodal edge inference suitable for sustainable AI deployment at the device level.

Abstract

The advancement of sophisticated artificial intelligence (AI) algorithms has led to a notable increase in energy usage and carbon dioxide emissions, intensifying concerns about climate change. This growing problem has brought the environmental sustainability of AI technologies to the forefront, especially as they expand across various sectors. In response to these challenges, there is an urgent need for the development of sustainable AI solutions. These solutions must focus on energy-efficient embedded systems that are capable of handling diverse data types even in environments with limited resources, thereby ensuring both technological progress and environmental responsibility. Integrating complementary multimodal data into tiny machine learning models for edge devices is challenging due to increased complexity, latency, and power consumption. This work introduces TinyM$^2$Net-V3, a system that processes different modalities of complementary data, designs deep neural network (DNN) models, and employs model compression techniques including knowledge distillation and low bit-width quantization with memory-aware considerations to fit models within lower memory hierarchy levels, reducing latency and enhancing energy efficiency on resource-constrained devices. We evaluated TinyM$^2$Net-V3 in two multimodal case studies: COVID-19 detection using cough, speech, and breathing audios, and pose classification from depth and thermal images. With tiny inference models (6 KB and 58 KB), we achieved 92.95% and 90.7% accuracies, respectively. Our tiny machine learning models, deployed on resource limited hardware, demonstrated low latencies within milliseconds and very high power efficiency.

TinyM$^2$Net-V3: Memory-Aware Compressed Multimodal Deep Neural Networks for Sustainable Edge Deployment

TL;DR

This work tackles the environmental and resource challenges of modern AI by proposing TinyM

Net-V3, a memory-aware, multimodal deep learning framework for sustainable edge deployment. It integrates memory-aware knowledge distillation with uniform 8-bit quantization to compress multimodal models to fit within on-chip memories, enabling low-latency, low-power inference on GAPuino and Raspberry Pi 4B. The approach is evaluated on two case studies—COVID-19 detection from cough, speech, and breathing audio, and pose classification from depth and thermal images—achieving 6 KB and 58 KB inference footprints with accuracies near 93% and 91%, respectively, while maintaining millisecond latencies and favorable energy efficiency. The results demonstrate practical, energy-efficient multimodal edge inference suitable for sustainable AI deployment at the device level.

Abstract

Net-V3, a system that processes different modalities of complementary data, designs deep neural network (DNN) models, and employs model compression techniques including knowledge distillation and low bit-width quantization with memory-aware considerations to fit models within lower memory hierarchy levels, reducing latency and enhancing energy efficiency on resource-constrained devices. We evaluated TinyM

Net-V3 in two multimodal case studies: COVID-19 detection using cough, speech, and breathing audios, and pose classification from depth and thermal images. With tiny inference models (6 KB and 58 KB), we achieved 92.95% and 90.7% accuracies, respectively. Our tiny machine learning models, deployed on resource limited hardware, demonstrated low latencies within milliseconds and very high power efficiency.

Paper Structure (10 sections, 2 equations, 6 figures, 2 tables)

This paper contains 10 sections, 2 equations, 6 figures, 2 tables.

Introduction
Proposed TinyM$^2$Net-V3 System
Multimodal DNN Model Architecture Design
Memory-Aware Model Compression
Deployment on Resource-Constrained Hardware
TinyM$^2$Net-V3 Evaluation Results and Analysis
Evaluation Case-Study 1: COVID-19 Detection from Multimodal Audios
Evaluation Case-Study 2: Pose Classification from Multimodal Depth Images and Thermal Images
TinyM$^2$Net-V3 Hardware Implementation Results and Analysis
Conclusion

Figures (6)

Figure 1: The high-level overview of the proposed TinyM$^2$Net-V3. TinyM$^2$Net-V3 is capable of handling any number of data modalities, designing ML models for specific tasks, compressing the models using state-of-the-art compression techniques knowledge distillation and low bit-width quantization, and subsequently deploying them on resource-constrained tiny hardware.
Figure 2: The flow diagram of proposed TinyM$^2$Net-V3 system. We consider pre-processed multimodal inputs for our proposed TinyM$^2$Net-V3. Proposed TinyM$^2$Net-V3 is the sequential combination of the steps shown in the diagram.
Figure 3: (a) Hardware Architecture for GAP8 microprocessor. (b) Memory Hierarchy of GAP8. GAP 8 microprocessor has L1 Memory of 100 KB (80 KB shared in compute engine + 20 KB for low power MCU.), l2 memory of 512 KB and 8MB of DRAM (c) Hardware Architecture for Arm Cortex-A72 microprocessor used in Raspberry Pi 4B. (d) Memory Hierarchy of ARM Cortex-A72 CPU, which has L1 Memory of 80 KB (48 KB Instruction Cache + 32 KB Data Cache), L2 memory of 1 MB, DRAM of 4 GB and external flash was 32 GB
Figure 4: The model architecture of the proposed TinyM$^2$Net-V3 for (a) Case-study 1 and (b) Case-study 2. Here, Conv2D = 2 dimensional CNN, SeparableConv2D = 2 dimensional depthwise-separable CNN and FC = Fully Connected Layer.
Figure 5: (a) TinyM$^2$Net-V3 classification results for Case-Study 1 in terms of both unimodal and multimodal settings. The multimodal setting improved 7% accuracy compared to unimodal (speech) classification setting. Model compression techniques reduce 1.6% accuracy of the multimodal setting. (b) Experiments for memory-aware knowledge distillation. (a) TinyM$^2$Net-V3 classification results for Case-Study 2 in terms of both unimodal and multimodal settings. The multimodal setting improved 6% accuracy compared to unimodal (thermal) classification setting. Model compression techniques reduce 1.4% accuracy of the multimodal setting. (d) Experiments for memory-aware knowledge distillation.
...and 1 more figures

TinyM$^2$Net-V3: Memory-Aware Compressed Multimodal Deep Neural Networks for Sustainable Edge Deployment

TL;DR

Abstract

TinyM$^2$Net-V3: Memory-Aware Compressed Multimodal Deep Neural Networks for Sustainable Edge Deployment

Authors

TL;DR

Abstract

Table of Contents

Figures (6)