Table of Contents
Fetching ...

A Selective Quantization Tuner for ONNX Models

Nikolaos Louloudakis, Ajitha Rajan

TL;DR

SeQTO addresses the challenge of deploying quantized DNNs on heterogeneous hardware by enabling selective quantization of ONNX models and optimizing accuracy–size trade-offs. It uses layer-activation analysis with metrics QDQ and XModel to produce an error metric $error_metric = 0.5 * norm_xmodel_err + 0.5 * norm_qdq_err$ and performs a Pareto-front search to identify top candidates, with deployment supported on CPU via ONNX Runtime and GPU via TVM. The approach yields Pareto-optimal models that substantially reduce size while limiting accuracy loss, demonstrated on four ONNX models across CPU and GPU, including low-end devices, achieving up to 54.14% reduction in accuracy loss and up to 98.18% of the size reduction preserved. These contributions provide a practical, hardware-aware toolkit for quantization that can be integrated into CI/CD pipelines.

Abstract

Quantization reduces the precision of deep neural networks to lower model size and computational demands, but often at the expense of accuracy. Fully quantized models can suffer significant accuracy degradation, and resource-constrained hardware accelerators may not support all quantized operations. A common workaround is selective quantization, where only some layers are quantized while others remain at full precision. However, determining the optimal balance between accuracy and efficiency is a challenging task. To this direction, we propose SeQTO, a framework that enables selective quantization, deployment, and execution of ONNX models on diverse CPU and GPU devices, combined with profiling and multi-objective optimization. SeQTO generates selectively quantized models, deploys them across hardware accelerators, evaluates performance on metrics such as accuracy and size, applies Pareto Front-based objective minimization to identify optimal candidates, and provides visualization of results. We evaluated SeQTO on four ONNX models under two quantization settings across CPU and GPU devices. Our results show that SeQTO effectively identifies high-quality selectively quantized models, achieving up to 54.14% lower accuracy loss while maintaining up to 98.18% of size reduction compared to fully quantized models.

A Selective Quantization Tuner for ONNX Models

TL;DR

SeQTO addresses the challenge of deploying quantized DNNs on heterogeneous hardware by enabling selective quantization of ONNX models and optimizing accuracy–size trade-offs. It uses layer-activation analysis with metrics QDQ and XModel to produce an error metric and performs a Pareto-front search to identify top candidates, with deployment supported on CPU via ONNX Runtime and GPU via TVM. The approach yields Pareto-optimal models that substantially reduce size while limiting accuracy loss, demonstrated on four ONNX models across CPU and GPU, including low-end devices, achieving up to 54.14% reduction in accuracy loss and up to 98.18% of the size reduction preserved. These contributions provide a practical, hardware-aware toolkit for quantization that can be integrated into CI/CD pipelines.

Abstract

Quantization reduces the precision of deep neural networks to lower model size and computational demands, but often at the expense of accuracy. Fully quantized models can suffer significant accuracy degradation, and resource-constrained hardware accelerators may not support all quantized operations. A common workaround is selective quantization, where only some layers are quantized while others remain at full precision. However, determining the optimal balance between accuracy and efficiency is a challenging task. To this direction, we propose SeQTO, a framework that enables selective quantization, deployment, and execution of ONNX models on diverse CPU and GPU devices, combined with profiling and multi-objective optimization. SeQTO generates selectively quantized models, deploys them across hardware accelerators, evaluates performance on metrics such as accuracy and size, applies Pareto Front-based objective minimization to identify optimal candidates, and provides visualization of results. We evaluated SeQTO on four ONNX models under two quantization settings across CPU and GPU devices. Our results show that SeQTO effectively identifies high-quality selectively quantized models, achieving up to 54.14% lower accuracy loss while maintaining up to 98.18% of size reduction compared to fully quantized models.

Paper Structure

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Architecture of SeQTO with the following primary modules: (1) Model Orchestrator, (2) Selective Quantization Module, (3) Runner Module, (4) Metrics Benchmarking Module, and (5) Objectives Visualizer Module.
  • Figure 2: Normalized XModel Error and QDQ Error across layers for original and quantized MobileNetV2. X-axis indicates layer index; Y-axis indicates normalized error value.
  • Figure 3: Selective quantization of ResNet50 across hardware: X-axis shows excluded layers; Y-axis shows normalized accuracy loss (model dissimilarity) and model size.