Table of Contents
Fetching ...

LinearARD: Linear-Memory Attention Distillation for RoPE Restoration

Ning Yang, Hengyu Zhong, Wentao Wang, Baoliang Tian, Haijun Zhang, Jun Wang

Abstract

The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short-text benchmarks. We propose LinearARD, a self-distillation method that restores Rotary Position Embeddings (RoPE)-scaled students through attention-structure consistency with a frozen native-RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row-wise distributions of dense $Q/Q$, $K/K$, and $V/V$ self-relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of $n \times n$ relation maps, we introduce a linear-memory kernel. This kernel leverages per-token log-sum-exp statistics and fuses logit recomputation into the backward pass to compute exact Kullback-Leibler divergence and gradients. On LLaMA2-7B extended from 4K to 32K, LinearARD recovers 98.3\% of the short-text performance of state-of-the-art baselines while surpassing them on long-context benchmarks. Notably, our method achieves these results using only \textbf{4.25M} training tokens compared to the \textbf{256M} tokens required by LongReD and CPT. Our code is available at https://github.com/gracefulning/LinearARD.

LinearARD: Linear-Memory Attention Distillation for RoPE Restoration

Abstract

The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short-text benchmarks. We propose LinearARD, a self-distillation method that restores Rotary Position Embeddings (RoPE)-scaled students through attention-structure consistency with a frozen native-RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row-wise distributions of dense , , and self-relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of relation maps, we introduce a linear-memory kernel. This kernel leverages per-token log-sum-exp statistics and fuses logit recomputation into the backward pass to compute exact Kullback-Leibler divergence and gradients. On LLaMA2-7B extended from 4K to 32K, LinearARD recovers 98.3\% of the short-text performance of state-of-the-art baselines while surpassing them on long-context benchmarks. Notably, our method achieves these results using only \textbf{4.25M} training tokens compared to the \textbf{256M} tokens required by LongReD and CPT. Our code is available at https://github.com/gracefulning/LinearARD.

Paper Structure

This paper contains 51 sections, 3 theorems, 19 equations, 4 figures, 8 tables, 2 algorithms.

Key Result

Proposition 3.1

Fix a query--key pair $(i,j)$ and consider the teacher and student probabilities $r_t \triangleq \mathbf{R}_t(i,j)$ and $r_s \triangleq \mathbf{R}_s(i,j)$, and the student logit $z_s \triangleq \mathbf{Z}_s(i,j)$. For $\mathcal{L}_{\text{MSE}} \triangleq \tfrac{1}{2}(r_s-r_t)^2$ and $\mathcal{L}_{\t Consequently, as $r_s \to 0$ with $r_t>0$: $\blacktriangleleft$$\blacktriangleleft$

Figures (4)

  • Figure 1: LinearARD pipeline. A frozen native-RoPE teacher provides dense row-wise relation distributions ($Q/Q$, $K/K$, and $V/V$), and the RoPE-scaled student is restored by minimizing relation KL with an exact linear-memory kernel.
  • Figure 2: Standard attention vs. QKV self-relations. (a) Attention matrix $\mathbf{A}$ computed from $\mathbf{Q}\mathbf{K}^{\top}$ (shown for reference), and (b--d) Q/Q, K/K, and V/V relation matrices computed from $\mathbf{Q}\mathbf{Q}^{\top}$, $\mathbf{K}\mathbf{K}^{\top}$, and $\mathbf{V}\mathbf{V}^{\top}$ (Eq. \ref{['eq:qkv_relations']}). LinearARD distills the row-wise relation distributions in (b--d) by aligning teacher vs. student rows via forward KL (Eq. \ref{['eq:kl_def']}) under the same mask $\mathbf{M}$.
  • Figure 3: Memory scaling of the linear-memory KL distillation Kernel in Sec. \ref{['subsec:io_aware']} as a function of sequence length.
  • Figure 4: Training dynamics of distillation objectives. The relation/attention loss (the optimization target) and the logit distillation loss (monitored but not optimized) are plotted across training steps. The synchronous decline suggests a strong causal link between internal attention drift and output distribution shift.

Theorems & Definitions (6)

  • Proposition 3.1: Gradient Behavior in Sparse Regimes
  • Theorem 3.2: Linear Memory Complexity
  • Proposition 3.3: Mathematical Exactness
  • proof
  • proof
  • proof