Quant-Trim in Practice: Improved Cross-Platform Low-Bit Deployment on Edge NPUs
Rayen Dhahri, Steffen Urban
TL;DR
Quant-Trim addresses cross-backend inconsistencies in low-bit edge deployment by training a hardware-agnostic checkpoint that remains robust under different compilers and precision regimes. It combines progressive fake quantization with reverse pruning to align training numerics with the deployed integer grid and suppress extreme weight tails that inflate scales, while preserving learnability. The approach is architecture-agnostic and exports via ONNX without vendor-specific graph edits, reducing the need for per-backend retraining while improving accuracy, calibration, and logit fidelity on edge NPUs. Practically, Quant-Trim enables reliable INT8/INT4 deployment with favorable latency, energy efficiency, and robustness across a range of devices and tasks, including NanoSAM2-based edge setups.
Abstract
Specialized edge accelerators rely on low-bit quantization, but vendor compilers differ in scaling, clipping, and kernel support, often as black boxes. The same floating-point (FP) checkpoint can therefore yield inconsistent accuracy across backends, forcing practitioners to tweak flags or refactor models to vendor-friendly operator subsets. We introduce Quant-Trim, a training-phase method that produces a hardware-neutral checkpoint robust to backend and precision choices. It combines progressive fake quantization to align training with the deployed integer grid and reverse pruning to tame outlier-driven scale inflation while preserving learnability. Quant-Trim is agnostic to quantization schemes (symmetric/asymmetric, per-tensor/per-channel, INT8/INT4) and requires no vendor-specific graph changes. Across models and tasks, it narrows the FP-to-low-bit gap, reduces dependence on compiler heuristics/calibration, and avoids per-backend retraining. We report accuracy and edge metrics latency, throughput, energy per inference, and cost under static/dynamic activation scaling and varying operator coverage.
