A Selective Quantization Tuner for ONNX Models

Nikolaos Louloudakis; Ajitha Rajan

A Selective Quantization Tuner for ONNX Models

Nikolaos Louloudakis, Ajitha Rajan

TL;DR

SeQTO addresses the challenge of deploying quantized DNNs on heterogeneous hardware by enabling selective quantization of ONNX models and optimizing accuracy–size trade-offs. It uses layer-activation analysis with metrics QDQ and XModel to produce an error metric $error_metric = 0.5 * norm_xmodel_err + 0.5 * norm_qdq_err$ and performs a Pareto-front search to identify top candidates, with deployment supported on CPU via ONNX Runtime and GPU via TVM. The approach yields Pareto-optimal models that substantially reduce size while limiting accuracy loss, demonstrated on four ONNX models across CPU and GPU, including low-end devices, achieving up to 54.14% reduction in accuracy loss and up to 98.18% of the size reduction preserved. These contributions provide a practical, hardware-aware toolkit for quantization that can be integrated into CI/CD pipelines.

Abstract

Quantization reduces the precision of deep neural networks to lower model size and computational demands, but often at the expense of accuracy. Fully quantized models can suffer significant accuracy degradation, and resource-constrained hardware accelerators may not support all quantized operations. A common workaround is selective quantization, where only some layers are quantized while others remain at full precision. However, determining the optimal balance between accuracy and efficiency is a challenging task. To this direction, we propose SeQTO, a framework that enables selective quantization, deployment, and execution of ONNX models on diverse CPU and GPU devices, combined with profiling and multi-objective optimization. SeQTO generates selectively quantized models, deploys them across hardware accelerators, evaluates performance on metrics such as accuracy and size, applies Pareto Front-based objective minimization to identify optimal candidates, and provides visualization of results. We evaluated SeQTO on four ONNX models under two quantization settings across CPU and GPU devices. Our results show that SeQTO effectively identifies high-quality selectively quantized models, achieving up to 54.14% lower accuracy loss while maintaining up to 98.18% of size reduction compared to fully quantized models.

A Selective Quantization Tuner for ONNX Models

TL;DR

Abstract

A Selective Quantization Tuner for ONNX Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)