Table of Contents
Fetching ...

Model Editing for LLMs4Code: How Far are We?

Xiaopeng Li, Shangwen Wang, Shasha Li, Jun Ma, Jie Yu, Xiaodong Liu, Jing Wang, Bin Ji, Weimin Zhang

TL;DR

This paper systematically evaluates six state-of-the-art model-editing methods on LLMs4Code using the CLMEEval benchmark, focusing on NL2PL and PL2NL tasks across three open-source LLMs4Code. It finds that external memorization approaches like GRACE offer the best balance of effectiveness and specificity but struggle with generalization, while generalization remains a universal challenge. To address this, the authors propose A-GRACE, a contrastive encoder augmentation that significantly improves generalization with modest overhead, achieving substantial gains on CNLE and CSNE. The work highlights the practical challenges of updating code knowledge in LLMs4Code and provides a strong baseline for future improvements in code-focused model editing.

Abstract

Large Language Models for Code (LLMs4Code) have been found to exhibit outstanding performance in the software engineering domain, especially the remarkable performance in coding tasks. However, even the most advanced LLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to the high cost of training LLMs4Code, it is impractical to re-train the models for fixing these problematic code knowledge. Model editing is a new technical field for effectively and efficiently correcting erroneous knowledge in LLMs, where various model editing techniques and benchmarks have been proposed recently. Despite that, a comprehensive study that thoroughly compares and analyzes the performance of the state-of-the-art model editing techniques for adapting the knowledge within LLMs4Code across various code-related tasks is notably absent. To bridge this gap, we perform the first systematic study on applying state-of-the-art model editing approaches to repair the inaccuracy of LLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists of two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and CodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help of CLMEEval, we evaluate six advanced model editing techniques on three LLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings include that the external memorization-based GRACE approach achieves the best knowledge editing effectiveness and specificity (the editing does not influence untargeted knowledge), while generalization (whether the editing can generalize to other semantically-identical inputs) is a universal challenge for existing techniques. Furthermore, building on in-depth case analysis, we introduce an enhanced version of GRACE called A-GRACE, which incorporates contrastive learning to better capture the semantics of the inputs.

Model Editing for LLMs4Code: How Far are We?

TL;DR

This paper systematically evaluates six state-of-the-art model-editing methods on LLMs4Code using the CLMEEval benchmark, focusing on NL2PL and PL2NL tasks across three open-source LLMs4Code. It finds that external memorization approaches like GRACE offer the best balance of effectiveness and specificity but struggle with generalization, while generalization remains a universal challenge. To address this, the authors propose A-GRACE, a contrastive encoder augmentation that significantly improves generalization with modest overhead, achieving substantial gains on CNLE and CSNE. The work highlights the practical challenges of updating code knowledge in LLMs4Code and provides a strong baseline for future improvements in code-focused model editing.

Abstract

Large Language Models for Code (LLMs4Code) have been found to exhibit outstanding performance in the software engineering domain, especially the remarkable performance in coding tasks. However, even the most advanced LLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to the high cost of training LLMs4Code, it is impractical to re-train the models for fixing these problematic code knowledge. Model editing is a new technical field for effectively and efficiently correcting erroneous knowledge in LLMs, where various model editing techniques and benchmarks have been proposed recently. Despite that, a comprehensive study that thoroughly compares and analyzes the performance of the state-of-the-art model editing techniques for adapting the knowledge within LLMs4Code across various code-related tasks is notably absent. To bridge this gap, we perform the first systematic study on applying state-of-the-art model editing approaches to repair the inaccuracy of LLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists of two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and CodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help of CLMEEval, we evaluate six advanced model editing techniques on three LLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings include that the external memorization-based GRACE approach achieves the best knowledge editing effectiveness and specificity (the editing does not influence untargeted knowledge), while generalization (whether the editing can generalize to other semantically-identical inputs) is a universal challenge for existing techniques. Furthermore, building on in-depth case analysis, we introduce an enhanced version of GRACE called A-GRACE, which incorporates contrastive learning to better capture the semantics of the inputs.

Paper Structure

This paper contains 20 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Data construction process of the CNLE and CSNE datasets
  • Figure 2: Envelope plot of Euclidean Distance between OI-GRI and that between OI-SRI (keys generated by GRACE).
  • Figure 3: Average time cost of selected model editing techniques per edit.
  • Figure 4: Average peak memory cost of selected model editing techniques per edit.
  • Figure 5: Envelope plot of Euclidean Distance between OI-GRI and that between OI-SRI (keys generated by A-GRACE).
  • ...and 1 more figures