Computer Vision Model Compression Techniques for Embedded Systems: A Survey

Alexandre Lopes; Fernando Pereira dos Santos; Diulhio de Oliveira; Mauricio Schiezaro; Helio Pedrini

Computer Vision Model Compression Techniques for Embedded Systems: A Survey

Alexandre Lopes, Fernando Pereira dos Santos, Diulhio de Oliveira, Mauricio Schiezaro, Helio Pedrini

TL;DR

This survey analyzes model compression techniques for computer vision targeted at embedded systems, categorizing approaches into Knowledge Distillation, Network Pruning, Network Quantization, and Low-Rank Matrix Factorization. It compares methods across standard CV tasks and datasets, discusses device-specific performance, and provides practical guidelines and a repository of case studies to aid researchers. The findings emphasize that performance depends heavily on the target hardware and workload, with KD and quantization being the most active areas, while transformer-based CV compression presents open challenges. The work also highlights the importance of benchmarking on the actual deployment device and suggests combining techniques to achieve better trade-offs for real-world embedded applications.

Abstract

Deep neural networks have consistently represented the state of the art in most computer vision problems. In these scenarios, larger and more complex models have demonstrated superior performance to smaller architectures, especially when trained with plenty of representative data. With the recent adoption of Vision Transformer (ViT) based architectures and advanced Convolutional Neural Networks (CNNs), the total number of parameters of leading backbone architectures increased from 62M parameters in 2012 with AlexNet to 7B parameters in 2024 with AIM-7B. Consequently, deploying such deep architectures faces challenges in environments with processing and runtime constraints, particularly in embedded systems. This paper covers the main model compression techniques applied for computer vision tasks, enabling modern models to be used in embedded systems. We present the characteristics of compression subareas, compare different approaches, and discuss how to choose the best technique and expected variations when analyzing it on various embedded devices. We also share codes to assist researchers and new practitioners in overcoming initial implementation challenges for each subarea and present trends for Model Compression. Case studies for compression models are available at \href{https://github.com/venturusbr/cv-model-compression}{https://github.com/venturusbr/cv-model-compression}.

Computer Vision Model Compression Techniques for Embedded Systems: A Survey

TL;DR

Abstract

Paper Structure (27 sections, 5 equations, 7 figures, 5 tables, 5 algorithms)

This paper contains 27 sections, 5 equations, 7 figures, 5 tables, 5 algorithms.

Introduction
Methodology
Compression Techniques
Knowledge Distillation
Offline Distillation
Online Distillation
Self-Distillation
Response-based Knowledge
Feature-based Knowledge
Relation-based Knowledge
Network Pruning
Ranking and Pruning Parameters
Reconstruction-based Methods
Similarity Measurement
Network Quantization
...and 12 more sections

Figures (7)

Figure 1: Most influential proposed techniques since 2015 for Model Compression applied for Computer Vision. Low-Rank Factorization was omitted here due to its recent lack of usage in the field. A combination of the number of citations and novelty in the paper determined the most influential papers.
Figure 2: Model Compression Technique subdivision. We categorized Model Compression papers into four different areas. We also sort them based on Computer Vision papers, including each subcategory quantity of papers found from Jan/2021 to Mar/2024.
Figure 3: Knowledge Distillation General Schematic. The teacher model usually receives the same input data as the student, and both features form the distillation loss, where the student will learn to mimic the teacher. The student can also have an application-dependent loss that varies depending on the application, such as cross-entropy for classification problems.
Figure 4: Network Pruning. The left side shows a binary model trained by conventional techniques and/or advanced strategies comprising five processing layers. The right side shows the same model after pruning. Different shades of blue indicate the degree of relevance of each element, determined by a rule: the darker the shade, the more important the element. Hence, connections to and from the low-relevance elements are removed after establishing a pruning approach. Irrelevant parameters from input and output layers remain after pruning. Then, the pruned structure can be fine-tuned. The pruned elements can be filters, channels, or structures.
Figure 5: Network Quantization. On the left, the original weights of a neural network are represented in a matrix format using 32-bit floating-point numbers. On the right, the network weights after the quantization process to an 8-bit integer.
...and 2 more figures

Computer Vision Model Compression Techniques for Embedded Systems: A Survey

TL;DR

Abstract

Computer Vision Model Compression Techniques for Embedded Systems: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (7)