Table of Contents
Fetching ...

aiXcoder-7B-v2: Training LLMs to Fully Utilize the Long Context in Repository-level Code Completion

Jia Li, Hao Zhu, Huanyu Liu, Xianjie Shi, He Zong, Yihong Dong, Kechi Zhang, Siyuan Jiang, Zhi Jin, Ge Li

TL;DR

This paper addresses the challenge of repo-level code completion where LLMs underutilize long-range repository context. It introduces CoLT, a reinforcement learning-based fine-tuning framework, to explicitly encourage leveraging long contexts, supported by the large CoLT-132K dataset. By applying CoLT to aiXcoder-7B, resulting in aiXcoder-7B-v2, the authors demonstrate substantial improvements in exact-match and BLEU scores, generalization to new languages, and enhanced long-context utilization across multiple models. The work provides practical guidance for practitioners and releases datasets and models to catalyze progress in long-context code completion with significant real-world impact.

Abstract

Repository-level code completion aims to complete code based on the long contexts of the repository. Existing studies extract long contexts from the repository as inputs and leverage Large Language Models (LLMs) to generate code. However, we reveal a severe limitation of LLMs, i.e., LLMs may ignore the information within long contexts in code completion. In other words, even the contexts contain useful information (e.g., relevant APIs or similar code), LLMs may fail to utilize this information. We think this limitation is caused by an inherent bias in LLMs, i.e., relying on nearby contexts and ignoring long-range contexts. To address this, we propose a novel fine-tuning approach named CoLT. The core idea of CoLT is to provide explicit supervision signals, which emphasize that long-range contexts may hold relevant information. Specifically, CoLT proposes a reinforcement learning-based training, which explicitly encourages models to utilize the information within long contexts and punishes models for ignoring long contexts. To support CoLT, we release CoLT-132K, a large-scale dataset with 132k samples across four languages, each containing long-context inputs. We apply CoLT to a popular LLM - aiXcoder-7B and release aiXcoder-7B-v2. We conduct extensive experiments on CoLT-132K and a public benchmark - CrossCodeEval. Our experiments yield the results: 1. Effectiveness. CoLT substantially improves aiXcoder-7B. aiXcoder-7B-v2 outperforms aiXcoder-7B by up to 44% in exact match. aiXcoder-7B-v2 becomes the state-of-the-art 7B model in code completion and even surpasses larger models. 2. Generalizability. The capability learned by CoLT can generalize to new languages. Besides, CoLT is model-agnostic and effectively improves multiple LLMs. 3. Enhanced Context Utilization Capability. CoLT significantly improves the capability of LLMs in utilizing the relevant information within long contexts.

aiXcoder-7B-v2: Training LLMs to Fully Utilize the Long Context in Repository-level Code Completion

TL;DR

This paper addresses the challenge of repo-level code completion where LLMs underutilize long-range repository context. It introduces CoLT, a reinforcement learning-based fine-tuning framework, to explicitly encourage leveraging long contexts, supported by the large CoLT-132K dataset. By applying CoLT to aiXcoder-7B, resulting in aiXcoder-7B-v2, the authors demonstrate substantial improvements in exact-match and BLEU scores, generalization to new languages, and enhanced long-context utilization across multiple models. The work provides practical guidance for practitioners and releases datasets and models to catalyze progress in long-context code completion with significant real-world impact.

Abstract

Repository-level code completion aims to complete code based on the long contexts of the repository. Existing studies extract long contexts from the repository as inputs and leverage Large Language Models (LLMs) to generate code. However, we reveal a severe limitation of LLMs, i.e., LLMs may ignore the information within long contexts in code completion. In other words, even the contexts contain useful information (e.g., relevant APIs or similar code), LLMs may fail to utilize this information. We think this limitation is caused by an inherent bias in LLMs, i.e., relying on nearby contexts and ignoring long-range contexts. To address this, we propose a novel fine-tuning approach named CoLT. The core idea of CoLT is to provide explicit supervision signals, which emphasize that long-range contexts may hold relevant information. Specifically, CoLT proposes a reinforcement learning-based training, which explicitly encourages models to utilize the information within long contexts and punishes models for ignoring long contexts. To support CoLT, we release CoLT-132K, a large-scale dataset with 132k samples across four languages, each containing long-context inputs. We apply CoLT to a popular LLM - aiXcoder-7B and release aiXcoder-7B-v2. We conduct extensive experiments on CoLT-132K and a public benchmark - CrossCodeEval. Our experiments yield the results: 1. Effectiveness. CoLT substantially improves aiXcoder-7B. aiXcoder-7B-v2 outperforms aiXcoder-7B by up to 44% in exact match. aiXcoder-7B-v2 becomes the state-of-the-art 7B model in code completion and even surpasses larger models. 2. Generalizability. The capability learned by CoLT can generalize to new languages. Besides, CoLT is model-agnostic and effectively improves multiple LLMs. 3. Enhanced Context Utilization Capability. CoLT significantly improves the capability of LLMs in utilizing the relevant information within long contexts.

Paper Structure

This paper contains 27 sections, 2 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Motivating Example ❶. Two SOTA LLMs fail to invoke the relevant API in the input context, leading to suboptimal results.
  • Figure 2: Motivating Example ❷. Two SOTA LLMs overlook the similar code in the input context, outputting wrong results.
  • Figure 3: The Pipeline for CoLT-132K Data Collection.
  • Figure 4: Three samples in CoLT-132K. The code to be completed is highlighted in yellow.
  • Figure 5: Training curves in SFT and RL. (a) Training and validation losses during SFT. (b) Reward accuracy during RL training. (c) Chosen vs. rejected rewards during RL training.
  • ...and 1 more figures