Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling
Ye Qiao, Haocheng Xu, Xiaofan Zhang, Sitao Huang
TL;DR
This work investigates the detrimental interaction between RoPE-based positional interpolation and post-training quantization when extending LLM context. It reveals that PI amplifies quantization errors through phase aliasing, dynamic-range dilation, anisotropy, and outlier shifts, and it introduces two diagnostics—interpolation pressure and tail-inflation ratios—to guide robust interventions. The authors propose Q-ROAR, a weight-only, band-wise RoPE rescaling method that learns per-band scales for $W_Q$ and $W_K$, guided by these diagnostics and a tiny long-context development set, without finetuning or architectural changes. Empirically, Q-ROAR achieves consistent long-context perplexity improvements (exceeding 14% relative) on GovReport and Proof-Pile benchmarks while preserving short-context performance and compatibility with standard LLM stacks. The approach offers a practical, portable fix to deploy longer-context LLMs under PTQ and RoPE interpolation across diverse hardware and software environments.
Abstract
Extending the context window support of large language models (LLMs) is crucial for tasks with long-distance dependencies. RoPE-based interpolation and extrapolation methods, such as linear scaling and frequency-aware schemes, enable longer input length support without retraining, while post-training quantization (PTQ) makes deployment practical. However, we show that combining RoPE position interpolation (PI) with PTQ degrades accuracy due to coupled effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs. rotated RoPE pairs, and outlier shifting that produces position-dependent logit noise. We provide, to the best of our knowledge, the first systematic analysis of the PI+PTQ approach and introduce two practical diagnostics: interpolation pressure (per-band sensitivity to phase scaling) and tail-inflation ratios (outlier shift from short to long contexts). Following the analysis results, we propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, interpolation-aware stabilization of PI for quantized LLMs. Q-ROAR groups RoPE dimensions into a small number of frequency bands and performs a lightweight search over per-band scales for Key and Query weights (with an optional symmetric variant to preserve logit scale). The search is guided by our diagnostics and uses a tiny long-context development dataset, requiring no fine-tuning to the model, no architecture or kernel changes, and no additional deployment overhead. Empirically, Q-ROAR reduces the model's perplexity on long-context workloads by more than 14%, while preserving short-context performance, inference throughput, and compatibility with existing LLM system stacks.
