RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Arpit Singh Gautam; Saurabh Jha

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Arpit Singh Gautam, Saurabh Jha

Abstract

Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Abstract

Paper Structure (98 sections, 24 equations, 15 figures, 27 tables, 3 algorithms)

This paper contains 98 sections, 24 equations, 15 figures, 27 tables, 3 algorithms.

Introduction
The Memory Wall in Large Language Models
Limitations of Existing Quantization Methods
Uniform Bit-Width Allocation
Lack of Transferability Across Models
Hardware and Deployment Challenges for Mixed Precision
Reframing Quantization as Sequential Decision Making
RAMP: Reinforcement Learning for Adaptive Mixed-Precision Quantization
SAC-Based Bit-Width Policy
Transferable 11-Dimensional Layer Embeddings
Quality-Prioritized Reward Function
Hardware-Aware Export with Scale Folding
Contributions
Background & Related Work
Model Compression Landscape
...and 83 more sections

Figures (15)

Figure 1: Overview of the RAMP pipeline. Stage 1 uses a Soft Actor-Critic agent in a distributed multi-GPU setting to discover a mixed-precision strategy. Stage 2 performs kernel-free compilation via scale folding. Stage 3 quantizes the model layer-by-layer and exports it in GGUF format for deployment.
Figure 2: Reward computation in RAMP. After applying the policy, the quantized model is evaluated on perplexity, memory footprint, and activation stability. These signals are combined into a scalar reward that prioritizes quality while enforcing the bit budget.
Figure 3: Training dynamics showing perplexity, average bit-width, and reward across episodes.
Figure 4: Best-so-far perplexity encountered during search.
Figure 5: Perplexity vs. model size on Llama-2-7B. RAMP dominates uniform 4-bit baselines.
...and 10 more figures

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Abstract

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Authors

Abstract

Table of Contents

Figures (15)