Table of Contents
Fetching ...

Quant-Trim in Practice: Improved Cross-Platform Low-Bit Deployment on Edge NPUs

Rayen Dhahri, Steffen Urban

TL;DR

Quant-Trim addresses cross-backend inconsistencies in low-bit edge deployment by training a hardware-agnostic checkpoint that remains robust under different compilers and precision regimes. It combines progressive fake quantization with reverse pruning to align training numerics with the deployed integer grid and suppress extreme weight tails that inflate scales, while preserving learnability. The approach is architecture-agnostic and exports via ONNX without vendor-specific graph edits, reducing the need for per-backend retraining while improving accuracy, calibration, and logit fidelity on edge NPUs. Practically, Quant-Trim enables reliable INT8/INT4 deployment with favorable latency, energy efficiency, and robustness across a range of devices and tasks, including NanoSAM2-based edge setups.

Abstract

Specialized edge accelerators rely on low-bit quantization, but vendor compilers differ in scaling, clipping, and kernel support, often as black boxes. The same floating-point (FP) checkpoint can therefore yield inconsistent accuracy across backends, forcing practitioners to tweak flags or refactor models to vendor-friendly operator subsets. We introduce Quant-Trim, a training-phase method that produces a hardware-neutral checkpoint robust to backend and precision choices. It combines progressive fake quantization to align training with the deployed integer grid and reverse pruning to tame outlier-driven scale inflation while preserving learnability. Quant-Trim is agnostic to quantization schemes (symmetric/asymmetric, per-tensor/per-channel, INT8/INT4) and requires no vendor-specific graph changes. Across models and tasks, it narrows the FP-to-low-bit gap, reduces dependence on compiler heuristics/calibration, and avoids per-backend retraining. We report accuracy and edge metrics latency, throughput, energy per inference, and cost under static/dynamic activation scaling and varying operator coverage.

Quant-Trim in Practice: Improved Cross-Platform Low-Bit Deployment on Edge NPUs

TL;DR

Quant-Trim addresses cross-backend inconsistencies in low-bit edge deployment by training a hardware-agnostic checkpoint that remains robust under different compilers and precision regimes. It combines progressive fake quantization with reverse pruning to align training numerics with the deployed integer grid and suppress extreme weight tails that inflate scales, while preserving learnability. The approach is architecture-agnostic and exports via ONNX without vendor-specific graph edits, reducing the need for per-backend retraining while improving accuracy, calibration, and logit fidelity on edge NPUs. Practically, Quant-Trim enables reliable INT8/INT4 deployment with favorable latency, energy efficiency, and robustness across a range of devices and tasks, including NanoSAM2-based edge setups.

Abstract

Specialized edge accelerators rely on low-bit quantization, but vendor compilers differ in scaling, clipping, and kernel support, often as black boxes. The same floating-point (FP) checkpoint can therefore yield inconsistent accuracy across backends, forcing practitioners to tweak flags or refactor models to vendor-friendly operator subsets. We introduce Quant-Trim, a training-phase method that produces a hardware-neutral checkpoint robust to backend and precision choices. It combines progressive fake quantization to align training with the deployed integer grid and reverse pruning to tame outlier-driven scale inflation while preserving learnability. Quant-Trim is agnostic to quantization schemes (symmetric/asymmetric, per-tensor/per-channel, INT8/INT4) and requires no vendor-specific graph changes. Across models and tasks, it narrows the FP-to-low-bit gap, reduces dependence on compiler heuristics/calibration, and avoids per-backend retraining. We report accuracy and edge metrics latency, throughput, energy per inference, and cost under static/dynamic activation scaling and varying operator coverage.

Paper Structure

This paper contains 48 sections, 13 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: Quant-Trim training pipeline. Our method combines two key components: (1) Reverse Pruning clips extreme weights at robust quantile thresholds $\tau_{\ell,t}$ to prevent scale inflation while retaining representational power, and (2) Progressive Fake Quantization smoothly interpolates between FP32 and INT8 execution via a curriculum schedule $\lambda_t$ to avoid optimization collapse. The blend coefficient gradually increases from 0 (full FP32 warmup) through a quartic ramp to 1 (full fake quantization), while computing per-tensor/channel scales and zero-points. Gradients flow via straight-through estimator (STE). The final model exports to standard ONNX without custom operators, ensuring compatibility with NPU compilers.
  • Figure 2: Distributional effect of Quant-Trim. Left: reverse pruning compresses weight tails, reducing scale inflation. Right: activations exhibit a narrower dynamic range, making INT8 mapping more stable.
  • Figure 3: Power–throughput trade-off for DINOv2 and ResNet-50. Batch=1, $224{\times}224$ input. $x$-axis: median FPS over 200 timed iters after 20 warm-ups; $y$-axis: average Peak-power (5 runs; whiskers show 5–95th percentiles). Encoding: color = device; marker shape = precision; filled markers = platform’s default runtime (NPUs: vendor runtime; NVIDIA: CUDA), unfilled markers = TensorRT. Left: DINOv2; Right: ResNet-50. Device specs in \ref{['tab:edge-npu-specs']}.
  • Figure 4: Training dynamics on CIFAR-100 (DINOv2). A small accuracy dip during the ramp phase is followed by convergence to MAP-level performance.
  • Figure 5: Quant-Trim exhibits a brief dip when fake-quantization ramps in, then recovers to near-baseline accuracy and loss by the end of training for ResNet on CIFAR-100.
  • ...and 6 more figures