Table of Contents
Fetching ...

Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling

Ye Qiao, Haocheng Xu, Xiaofan Zhang, Sitao Huang

TL;DR

This work investigates the detrimental interaction between RoPE-based positional interpolation and post-training quantization when extending LLM context. It reveals that PI amplifies quantization errors through phase aliasing, dynamic-range dilation, anisotropy, and outlier shifts, and it introduces two diagnostics—interpolation pressure and tail-inflation ratios—to guide robust interventions. The authors propose Q-ROAR, a weight-only, band-wise RoPE rescaling method that learns per-band scales for $W_Q$ and $W_K$, guided by these diagnostics and a tiny long-context development set, without finetuning or architectural changes. Empirically, Q-ROAR achieves consistent long-context perplexity improvements (exceeding 14% relative) on GovReport and Proof-Pile benchmarks while preserving short-context performance and compatibility with standard LLM stacks. The approach offers a practical, portable fix to deploy longer-context LLMs under PTQ and RoPE interpolation across diverse hardware and software environments.

Abstract

Extending the context window support of large language models (LLMs) is crucial for tasks with long-distance dependencies. RoPE-based interpolation and extrapolation methods, such as linear scaling and frequency-aware schemes, enable longer input length support without retraining, while post-training quantization (PTQ) makes deployment practical. However, we show that combining RoPE position interpolation (PI) with PTQ degrades accuracy due to coupled effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs. rotated RoPE pairs, and outlier shifting that produces position-dependent logit noise. We provide, to the best of our knowledge, the first systematic analysis of the PI+PTQ approach and introduce two practical diagnostics: interpolation pressure (per-band sensitivity to phase scaling) and tail-inflation ratios (outlier shift from short to long contexts). Following the analysis results, we propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, interpolation-aware stabilization of PI for quantized LLMs. Q-ROAR groups RoPE dimensions into a small number of frequency bands and performs a lightweight search over per-band scales for Key and Query weights (with an optional symmetric variant to preserve logit scale). The search is guided by our diagnostics and uses a tiny long-context development dataset, requiring no fine-tuning to the model, no architecture or kernel changes, and no additional deployment overhead. Empirically, Q-ROAR reduces the model's perplexity on long-context workloads by more than 14%, while preserving short-context performance, inference throughput, and compatibility with existing LLM system stacks.

Rethinking RoPE Scaling in Quantized LLM: Theory, Outlier, and Channel-Band Analysis with Weight Rescaling

TL;DR

This work investigates the detrimental interaction between RoPE-based positional interpolation and post-training quantization when extending LLM context. It reveals that PI amplifies quantization errors through phase aliasing, dynamic-range dilation, anisotropy, and outlier shifts, and it introduces two diagnostics—interpolation pressure and tail-inflation ratios—to guide robust interventions. The authors propose Q-ROAR, a weight-only, band-wise RoPE rescaling method that learns per-band scales for and , guided by these diagnostics and a tiny long-context development set, without finetuning or architectural changes. Empirically, Q-ROAR achieves consistent long-context perplexity improvements (exceeding 14% relative) on GovReport and Proof-Pile benchmarks while preserving short-context performance and compatibility with standard LLM stacks. The approach offers a practical, portable fix to deploy longer-context LLMs under PTQ and RoPE interpolation across diverse hardware and software environments.

Abstract

Extending the context window support of large language models (LLMs) is crucial for tasks with long-distance dependencies. RoPE-based interpolation and extrapolation methods, such as linear scaling and frequency-aware schemes, enable longer input length support without retraining, while post-training quantization (PTQ) makes deployment practical. However, we show that combining RoPE position interpolation (PI) with PTQ degrades accuracy due to coupled effects including long-context aliasing, dynamic-range dilation, anisotropy from axis-aligned quantizers vs. rotated RoPE pairs, and outlier shifting that produces position-dependent logit noise. We provide, to the best of our knowledge, the first systematic analysis of the PI+PTQ approach and introduce two practical diagnostics: interpolation pressure (per-band sensitivity to phase scaling) and tail-inflation ratios (outlier shift from short to long contexts). Following the analysis results, we propose Q-ROAR (Quantization, RoPE-interpolation, and Outlier Aware Rescaling), a weight-only, interpolation-aware stabilization of PI for quantized LLMs. Q-ROAR groups RoPE dimensions into a small number of frequency bands and performs a lightweight search over per-band scales for Key and Query weights (with an optional symmetric variant to preserve logit scale). The search is guided by our diagnostics and uses a tiny long-context development dataset, requiring no fine-tuning to the model, no architecture or kernel changes, and no additional deployment overhead. Empirically, Q-ROAR reduces the model's perplexity on long-context workloads by more than 14%, while preserving short-context performance, inference throughput, and compatibility with existing LLM system stacks.

Paper Structure

This paper contains 38 sections, 27 equations, 23 figures, 4 tables, 1 algorithm.

Figures (23)

  • Figure 1: Perplexity of quantized Llama-2-7b evaluated on the GovReport Dataset
  • Figure 2: Position Interpolation Methods Comparison. Performance comparison of NTK-aware scaling, YARN interpolation, and no interpolation across FP16 and AWQ quantized models. YARN consistently demonstrates superior performance for long-context extension, maintaining lower perplexity degradation as sequence length increases.
  • Figure 3: Detailed YARN vs NTK Analysis. In-depth comparison showing (a) direct performance comparison, (b) relative improvement of YARN over NTK, (c) scaling factor analysis, and (d) convergence behavior. YARN shows consistent advantages, particularly at longer sequence lengths with up to 15% relative improvement over NTK interpolation.
  • Figure 4: Interpolation Method Deep Dive Analysis. Comprehensive analysis across four dimensions: (a) method comparison with error bars, (b) relative performance improvements, (c) sequence length sensitivity analysis, and (d) convergence stability metrics. Results demonstrate YARN's robustness across different evaluation criteria.
  • Figure 5: Main Quantization Methods Comparison. Performance evaluation of primary quantization techniques across extended sequences. NF4 achieves the best quality-compression trade-off, while AWQ provides an excellent balance between performance and efficiency for practical deployment.
  • ...and 18 more figures