Table of Contents
Fetching ...

Let the Code LLM Edit Itself When You Edit the Code

Zhenyu He, Jun Zhang, Shengjie Luo, Jingjing Xu, Zhi Zhang, Di He

TL;DR

This work tackles the latency challenge of real-time code editing with LLMs by addressing KV-cache temporal confusion after edits. It introduces Positional Integrity Encoding (PIE), a RoPE-based method that removes the rotary components causing misalignment and reapplies correct rotations with a single matrix operation to update the cache. PIE preserves full-recomputation accuracy while reducing KV-cache encoding overhead by over 85% across 1.3B, 6.7B, and 33B models on multiple languages and editing tasks, including insertion, deletion, and multi-place edits. The approach enables efficient, accurate AI-assisted coding in dynamic editing scenarios and is compatible with existing KV-cache eviction strategies, offering practical impact for real-world programming workflows and interactive code generation systems.

Abstract

In this work, we investigate a typical scenario in code generation where a developer edits existing code in real time and requests a code assistant, e.g., a large language model, to re-predict the next token or next line on the fly. Naively, the LLM needs to re-encode the entire KV cache to provide an accurate prediction. However, this process is computationally expensive, especially when the sequence length is long. Simply encoding the edited subsequence and integrating it to the original KV cache meets the temporal confusion problem, leading to significantly worse performance. We address this efficiency and accuracy trade-off by introducing \underline{\textbf{Positional \textbf{I}ntegrity \textbf{E}ncoding} (PIE). Building upon the rotary positional encoding, PIE first removes the rotary matrices in the Key cache that introduce temporal confusion and then reapplies the correct rotary matrices. This process ensures that positional relationships between tokens are correct and requires only a single round of matrix multiplication. We validate the effectiveness of PIE through extensive experiments on the RepoBench-C-8k dataset, utilizing DeepSeek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance.

Let the Code LLM Edit Itself When You Edit the Code

TL;DR

This work tackles the latency challenge of real-time code editing with LLMs by addressing KV-cache temporal confusion after edits. It introduces Positional Integrity Encoding (PIE), a RoPE-based method that removes the rotary components causing misalignment and reapplies correct rotations with a single matrix operation to update the cache. PIE preserves full-recomputation accuracy while reducing KV-cache encoding overhead by over 85% across 1.3B, 6.7B, and 33B models on multiple languages and editing tasks, including insertion, deletion, and multi-place edits. The approach enables efficient, accurate AI-assisted coding in dynamic editing scenarios and is compatible with existing KV-cache eviction strategies, offering practical impact for real-world programming workflows and interactive code generation systems.

Abstract

In this work, we investigate a typical scenario in code generation where a developer edits existing code in real time and requests a code assistant, e.g., a large language model, to re-predict the next token or next line on the fly. Naively, the LLM needs to re-encode the entire KV cache to provide an accurate prediction. However, this process is computationally expensive, especially when the sequence length is long. Simply encoding the edited subsequence and integrating it to the original KV cache meets the temporal confusion problem, leading to significantly worse performance. We address this efficiency and accuracy trade-off by introducing \underline{\textbf{Positional \textbf{I}ntegrity \textbf{E}ncoding} (PIE). Building upon the rotary positional encoding, PIE first removes the rotary matrices in the Key cache that introduce temporal confusion and then reapplies the correct rotary matrices. This process ensures that positional relationships between tokens are correct and requires only a single round of matrix multiplication. We validate the effectiveness of PIE through extensive experiments on the RepoBench-C-8k dataset, utilizing DeepSeek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance.
Paper Structure (31 sections, 7 equations, 4 figures, 10 tables)

This paper contains 31 sections, 7 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Latency and accuracy comparison of the full recomputation approach and our PIE using DeepSeek-Coder 6.7B on the RepoBench-C-8k(XF-F) Python dataset on a single A100 GPU. The latency only records the time cost for the KV cache update.
  • Figure 2: Illustration of the KV cache mechanism in both static and real-time editing settings for large language models (LLMs). Top: In the static setting, the model processes a fixed input to generate predictions, leveraging precomputed Key/Value (KV) pairs stored in the cache. Bottom: In the real-time editing setting, the input code is frequently edited, necessitating updates to the KV cache to maintain accurate information to generate the correct next tokens. Our objective is to optimize the efficiency of the green arrow pathway, which represents the process of updating the KV cache in response to code edits.
  • Figure 3: Cosine similarity of key representations across model layers. The plots compare the cosine similarity between ${\bm{K}}_{[j+1:n]}$ and ${\bm{K}}^*_{[j+1:n]}$ (indicating temporal confusion of Conflict Fast Encoding) with the cosine similarity between ${\bm{K}}^{\text{edit}}_{[j+1:n]}$ and ${\bm{K}}^*_{[j+1:n]}$ (showing the effectiveness of PIE).
  • Figure 4: KL divergence of the generated token distributions. The plots compare the KL divergence between the generated token distributions of PIE and Full-recomputation, and the KL divergence between the generated token distributions of Conflict Fast Encoding and Full-recomputation.